Data Prepration Guide
Each sub-dataset (delivery, pickup) contains 5 CSV files, with each representing the data from a specific city, the detail of each city can be find in the following dataset.
To merge five city datasets of delivery into a single dataset for preprocessing, follow these steps:

Steps to Merge Datasets
1. Check Column Consistency
Ensure all datasets have identical column structures, except for the city-specific column, which should contain the city name.
2. Add a City Column (if missing)
If the city name is not included as a column in each file, add a column explicitly for the city name before merging.
Example for one file:
df['city'] = 'CityName'
3. Load All Datasets
Load the five datasets using a library like
pandas
.Example:
import pandas as pd # Load datasets for each city df_city1 = pd.read_csv('city1.csv') df_city2 = pd.read_csv('city2.csv') df_city3 = pd.read_csv('city3.csv') df_city4 = pd.read_csv('city4.csv') df_city5 = pd.read_csv('city5.csv')
4. Add City Names (if needed)
Add the city name as a new column to each dataset:
df_city1['city'] = 'City1' df_city2['city'] = 'City2' df_city3['city'] = 'City3' df_city4['city'] = 'City4' df_city5['city'] = 'City5'
5. Combine Datasets
Use
pd.concat()
to merge all datasets into a single DataFrame:merged_df = pd.concat([df_city1, df_city2, df_city3, df_city4, df_city5], ignore_index=True)
6. Verify the Merged Dataset
Check the merged dataset for inconsistencies or anomalies:
Confirm the total row count matches the sum of all rows across the individual files.
Ensure the
city
column contains the correct city names.
print(merged_df.info())
print(merged_df['city'].value_counts())
7. Save the Combined Dataset
Save the merged dataset for subsequent preprocessing:
merged_df.to_csv('merged_city_data.csv', index=False)
Post-Merge Validation
After merging:
Check for Duplicates:
Ensure no duplicate entries exist in the merged dataset:
merged_df = merged_df.drop_duplicates()
Handle Missing Values:
Reassess missing values now that all data is combined.
Column Alignment:
Ensure all columns are correctly formatted and ready for preprocessing.
By following this approach, you’ll have a unified dataset containing all five cities' data, ready for preprocessing and further analysis. Let me know if you'd like help automating this process!
Last updated