Delivery Data Preprocessing

First Task: Data Preprocessing for Delivery Dataset

Preprocessing the dataset involves preparing it for analysis and modeling. Here’s a detailed breakdown of steps tailored to the columns in the dataset.


Step 1: Understand the Dataset

  • Columns to review:

    • Package Information: package_id

    • Location Data: lng, lat, region_id, city, aoi_id, aoi_type

    • Courier Data: courier_id

    • Task-Event Information: accept_time, accept_gps_time, accept_gps_lng, accept_gps_lat, delivery_time, delivery_gps_time, delivery_gps_lng, delivery_gps_lat

    • Context Information: ds


Step 2: Data Cleaning

  1. Handle Missing Values:

    • Key Columns: Check for missing values in essential columns (accept_time, delivery_time, lng, lat, courier_id).

    • Action:

      • Drop rows with missing accept_time or delivery_time as these are crucial for the target variable.

      • Impute missing lng/lat values with the region or city mean.

  2. Remove Duplicates:

    • Use package_id as a unique identifier to ensure no duplicate entries exist.

    • Example:

      df = df.drop_duplicates(subset='package_id', keep='first')
  3. Verify Data Types:

    • Ensure columns have the correct data type:

      • accept_time & delivery_time: Convert to datetime.

      • lng, lat: Ensure they are float.

      • aoi_type: Ensure it’s categorical.


Step 3: Outlier Detection

  1. Delivery Duration:

    • Identify abnormally long or short delivery durations.

    • Use z-score or IQR to detect and remove these outliers.

  2. Geospatial Anomalies:

    • Validate lng/lat values for any out-of-bound entries:

      • Latitude range: -90 to 90.

      • Longitude range: -180 to 180.


Step 4: Data Transformation

  1. Encode Categorical Variables:

    • Convert city and aoi_type to numeric using one-hot or label encoding.

    • Example:

      df = pd.get_dummies(df, columns=['aoi_type'], drop_first=True)
  2. Standardize Numerical Features:

    • Normalize features like distance and delivery_duration to ensure all variables are on a similar scale.


Step 5: Save Cleaned Data

  • Save the preprocessed dataset for further modeling and analysis.

  • Example:

    df.to_csv('cleaned_delivery_data.csv', index=False)

Last updated