Data Prepration Guide

Each sub-dataset (delivery, pickup) contains 5 CSV files, with each representing the data from a specific city, the detail of each city can be find in the following dataset.

To merge five city datasets of delivery into a single dataset for preprocessing, follow these steps:

Data source

Steps to Merge Datasets

1. Check Column Consistency

  • Ensure all datasets have identical column structures, except for the city-specific column, which should contain the city name.

2. Add a City Column (if missing)

  • If the city name is not included as a column in each file, add a column explicitly for the city name before merging.

  • Example for one file:

    df['city'] = 'CityName'

3. Load All Datasets

  • Load the five datasets using a library like pandas.

  • Example:

    import pandas as pd
    
    # Load datasets for each city
    df_city1 = pd.read_csv('city1.csv')
    df_city2 = pd.read_csv('city2.csv')
    df_city3 = pd.read_csv('city3.csv')
    df_city4 = pd.read_csv('city4.csv')
    df_city5 = pd.read_csv('city5.csv')

4. Add City Names (if needed)

  • Add the city name as a new column to each dataset:

    df_city1['city'] = 'City1'
    df_city2['city'] = 'City2'
    df_city3['city'] = 'City3'
    df_city4['city'] = 'City4'
    df_city5['city'] = 'City5'

5. Combine Datasets

  • Use pd.concat() to merge all datasets into a single DataFrame:

    merged_df = pd.concat([df_city1, df_city2, df_city3, df_city4, df_city5], ignore_index=True)

6. Verify the Merged Dataset

  • Check the merged dataset for inconsistencies or anomalies:

    • Confirm the total row count matches the sum of all rows across the individual files.

    • Ensure the city column contains the correct city names.

print(merged_df.info())
print(merged_df['city'].value_counts())

7. Save the Combined Dataset

  • Save the merged dataset for subsequent preprocessing:

    merged_df.to_csv('merged_city_data.csv', index=False)

Post-Merge Validation

After merging:

  1. Check for Duplicates:

    • Ensure no duplicate entries exist in the merged dataset:

      merged_df = merged_df.drop_duplicates()
  2. Handle Missing Values:

    • Reassess missing values now that all data is combined.

  3. Column Alignment:

    • Ensure all columns are correctly formatted and ready for preprocessing.


By following this approach, you’ll have a unified dataset containing all five cities' data, ready for preprocessing and further analysis. Let me know if you'd like help automating this process!

Last updated