Advanced Data Preparation

Below is a detailed breakdown of tasks to merge the Delivery and Pickup datasets, integrate the Roads and Trajectory datasets, apply preprocessing, and perform EDA (Exploratory Data Analysis) and feature engineering as per meeting discussion:


Task 1: Merge Delivery and Pickup Datasets

  1. Load the Data:

    • Load the Pickup Dataset and Delivery Dataset into your Python environment (e.g., using Pandas).

    • Inspect the datasets for missing values, duplicate entries, and data types.

  2. Merge Datasets:

    • Perform an inner join on package_id to link the Pickup and Delivery data.

    • Retain the following columns:

      • From Pickup Dataset: package_id, pickup_time, pickup_gps_lng, pickup_gps_lat, city, region_id, aoi_id, aoi_type, courier_id.

      • From Delivery Dataset: delivery_time, delivery_gps_lng, delivery_gps_lat.

  3. Calculate Target Variable (ETA):

    • Create a new column, ETA, as: \text{ETA} = \text{delivery_time} - \text{pickup_time}

    • Ensure ETA is in a consistent unit (e.g., minutes or seconds).

  4. Handle Missing Values:

    • Identify and address missing pickup_time, delivery_time, or coordinate values.

    • Drop or impute missing entries based on the impact on the dataset.

  5. Save Merged Dataset:

    • Save the merged dataset for future use (e.g., as a CSV file).


Task 2: Integrate Roads Dataset

  1. Match Coordinates with Roads:

    • Use spatial joins (e.g., Geopandas) to map the pickup_gps_lng/lat and delivery_gps_lng/lat to the nearest road segments in the Roads dataset.

    • Retain relevant road features for pickup and delivery points:

      • maxspeed, fclass, oneway, bridge, tunnel.

  2. Calculate Road-Based Distance:

    • Use the geometry column from the Roads dataset to calculate the road distance between pickup and delivery points (instead of straight-line Haversine distance).

  3. Add Road-Specific Features:

    • Extract additional road features for the courier’s route, such as:

      • Average speed limit (maxspeed).

      • Proportion of tunnels, bridges, or one-way roads.


Task 3: Integrate Trajectory Dataset

  1. Filter Trajectory Data:

    • Filter the Trajectory Dataset by courier_id and ds to match the courier and date of delivery.

  2. Identify Relevant Trajectory Points:

    • Extract trajectory points between pickup_time and delivery_time.

    • Calculate the courier’s total distance traveled during this time using consecutive trajectory points (e.g., Haversine formula).

  3. Calculate Courier-Based Features:

    • Derive trajectory-based features:

      • Average Speed: Total distance / total time.

      • Speed Variance: Variance in speed across trajectory points.

      • Number of Stops: Points where speed drops below a threshold.


Task 4: Preprocess Data

  1. Handle Missing Values:

    • Impute missing values in road, trajectory, or pickup/delivery coordinates.

  2. Normalize Continuous Features:

    • Scale features like distance, avg_speed, and maxspeed using Min-Max Scaling or Standard Scaling.

  3. Encode Categorical Variables:

    • Encode variables like fclass, aoi_type, and city using one-hot or label encoding.

  4. Create New Features:

    • Time-Based Features:

      • Day of the week (weekday/weekend).

      • Hour of the day (morning, afternoon, evening).

    • Location-Based Features:

      • Distance between pickup and delivery (Haversine and road-based).

      • City-level traffic congestion metrics (if available).

    • Courier Behavior Features:

      • Historical average delivery time per courier.

      • Courier workload (number of deliveries handled per day).


Task 5: Perform EDA

  1. Analyze Target Variable (ETA):

    • Plot the distribution of ETA to check for outliers and skewness.

    • Explore how ETA varies with:

      • Distance between pickup and delivery points.

      • Time of day (rush hour vs. non-rush hour).

      • Road features (maxspeed, fclass).

  2. Explore Relationships Between Features:

    • Scatter plots:

      • Distance vs. ETA.

      • Avg speed vs. ETA.

    • Box plots:

      • ETA across different fclass categories.

    • Heatmap:

      • Correlation matrix of numerical features.

  3. Investigate Outliers:

    • Identify extreme values for ETA, distance, or avg_speed.

    • Determine if these outliers represent genuine data points or errors.

  4. Check Data Distributions:

    • Histograms for numerical features (e.g., distance, maxspeed).

    • Count plots for categorical features (e.g., fclass, city).


Task 6: Perform Feature Engineering

  1. Combine Features:

    • Aggregate features across datasets into a unified dataset:

      • Road-based: maxspeed, fclass, road distance.

      • Trajectory-based: avg_speed, distance traveled, number of stops.

      • Courier-based: historical performance metrics.

      • Spatial: pickup/delivery coordinates, city, region.

      • Temporal: day of the week, hour of the day.

  2. Interaction Features:

    • Create features capturing interactions between variables, such as:

      • Distance × Avg Speed.

      • Maxspeed × Road Type.

  3. Group-Level Aggregations:

    • Compute group-level statistics:

      • Average delivery time per courier (courier_id).

      • Average delivery time per city (city).


Final Steps:

  1. Save Preprocessed Data:

    • Save the fully preprocessed and merged dataset for modeling.

Last updated