Data Prepration Guide

Each sub-dataset (delivery, pickup) contains 5 CSV files, with each representing the data from a specific city, the detail of each city can be find in the following dataset.

To merge five city datasets of delivery into a single dataset for preprocessing, follow these steps:

Data source

Steps to Merge Datasets

1. Check Column Consistency

  • Ensure all datasets have identical column structures, except for the city-specific column, which should contain the city name.

2. Add a City Column (if missing)

  • If the city name is not included as a column in each file, add a column explicitly for the city name before merging.

  • Example for one file:

3. Load All Datasets

  • Load the five datasets using a library like pandas.

  • Example:

4. Add City Names (if needed)

  • Add the city name as a new column to each dataset:

5. Combine Datasets

  • Use pd.concat() to merge all datasets into a single DataFrame:

6. Verify the Merged Dataset

  • Check the merged dataset for inconsistencies or anomalies:

    • Confirm the total row count matches the sum of all rows across the individual files.

    • Ensure the city column contains the correct city names.

7. Save the Combined Dataset

  • Save the merged dataset for subsequent preprocessing:


Post-Merge Validation

After merging:

  1. Check for Duplicates:

    • Ensure no duplicate entries exist in the merged dataset:

  2. Handle Missing Values:

    • Reassess missing values now that all data is combined.

  3. Column Alignment:

    • Ensure all columns are correctly formatted and ready for preprocessing.


By following this approach, you’ll have a unified dataset containing all five cities' data, ready for preprocessing and further analysis. Let me know if you'd like help automating this process!

Last updated