Delivery Data Preprocessing
First Task: Data Preprocessing for Delivery Dataset
Preprocessing the dataset involves preparing it for analysis and modeling. Here’s a detailed breakdown of steps tailored to the columns in the dataset.
Step 1: Understand the Dataset
Columns to review:
Package Information:
package_id
Location Data:
lng
,lat
,region_id
,city
,aoi_id
,aoi_type
Courier Data:
courier_id
Task-Event Information:
accept_time
,accept_gps_time
,accept_gps_lng
,accept_gps_lat
,delivery_time
,delivery_gps_time
,delivery_gps_lng
,delivery_gps_lat
Context Information:
ds
Step 2: Data Cleaning
Handle Missing Values:
Key Columns: Check for missing values in essential columns (
accept_time
,delivery_time
,lng
,lat
,courier_id
).Action:
Drop rows with missing
accept_time
ordelivery_time
as these are crucial for the target variable.Impute missing
lng
/lat
values with the region or city mean.
Remove Duplicates:
Use
package_id
as a unique identifier to ensure no duplicate entries exist.Example:
df = df.drop_duplicates(subset='package_id', keep='first')
Verify Data Types:
Ensure columns have the correct data type:
accept_time
&delivery_time
: Convert todatetime
.lng
,lat
: Ensure they arefloat
.aoi_type
: Ensure it’scategorical
.
Step 3: Outlier Detection
Delivery Duration:
Identify abnormally long or short delivery durations.
Use z-score or IQR to detect and remove these outliers.
Geospatial Anomalies:
Validate
lng/lat
values for any out-of-bound entries:Latitude range:
-90 to 90
.Longitude range:
-180 to 180
.
Step 4: Data Transformation
Encode Categorical Variables:
Convert
city
andaoi_type
to numeric using one-hot or label encoding.Example:
df = pd.get_dummies(df, columns=['aoi_type'], drop_first=True)
Standardize Numerical Features:
Normalize features like
distance
anddelivery_duration
to ensure all variables are on a similar scale.
Step 5: Save Cleaned Data
Save the preprocessed dataset for further modeling and analysis.
Example:
df.to_csv('cleaned_delivery_data.csv', index=False)
Last updated