Data Prepration Guide

Each sub-dataset (delivery, pickup) contains 5 CSV files, with each representing the data from a specific city, the detail of each city can be find in the following dataset.

To merge five city datasets of delivery into a single dataset for preprocessing, follow these steps:

Steps to Merge Datasets

1. Check Column Consistency

Ensure all datasets have identical column structures, except for the city-specific column, which should contain the city name.

2. Add a City Column (if missing)

If the city name is not included as a column in each file, add a column explicitly for the city name before merging.
Example for one file:
```
df['city'] = 'CityName'
```

3. Load All Datasets

Load the five datasets using a library like pandas.

Example:

import pandas as pd

# Load datasets for each city
df_city1 = pd.read_csv('city1.csv')
df_city2 = pd.read_csv('city2.csv')
df_city3 = pd.read_csv('city3.csv')
df_city4 = pd.read_csv('city4.csv')
df_city5 = pd.read_csv('city5.csv')

4. Add City Names (if needed)

Add the city name as a new column to each dataset:

df_city1['city'] = 'City1'
df_city2['city'] = 'City2'
df_city3['city'] = 'City3'
df_city4['city'] = 'City4'
df_city5['city'] = 'City5'

5. Combine Datasets

Use pd.concat() to merge all datasets into a single DataFrame:

merged_df = pd.concat([df_city1, df_city2, df_city3, df_city4, df_city5], ignore_index=True)

6. Verify the Merged Dataset

Check the merged dataset for inconsistencies or anomalies:
- Confirm the total row count matches the sum of all rows across the individual files.
- Ensure the city column contains the correct city names.

print(merged_df.info())
print(merged_df['city'].value_counts())

7. Save the Combined Dataset

Save the merged dataset for subsequent preprocessing:
```
merged_df.to_csv('merged_city_data.csv', index=False)
```

Post-Merge Validation

After merging:

Check for Duplicates:
- Ensure no duplicate entries exist in the merged dataset:
  merged_df = merged_df.drop_duplicates()
Handle Missing Values:
- Reassess missing values now that all data is combined.
Column Alignment:
- Ensure all columns are correctly formatted and ready for preprocessing.

By following this approach, you’ll have a unified dataset containing all five cities' data, ready for preprocessing and further analysis. Let me know if you'd like help automating this process!

PreviousDelivery Data Preprocessing NextPreprocessing Using Power BI and Excel

Last updated 9 months ago