Pickup Data Preprocessing
Here is a detailed description of all the steps in the Python data preprocessing pipeline for the pickup dataset:
Step-by-Step Description
1. Import Libraries
Description: We begin by importing necessary Python libraries:
pandas
for data manipulation.numpy
for numerical computations.datetime
for handling date and time formats.
These libraries form the foundation for handling, transforming, and analyzing data.
2. Load and Merge Datasets
Description: The dataset is spread across five CSV files for different cities. We:
Load each file into a pandas DataFrame.
Add a
city
column to track the origin of each record (if not already present).Merge all files into a single DataFrame using
pd.concat()
to enable unified processing.
3. Handle Missing and Inconsistent Values
Description: Missing or inconsistent data can distort analyses. Here’s how we address them:
Identify missing values using
.isnull().sum()
to assess data health.Fill categorical data (
aoi_type
,city
) with placeholders like "Unknown".Replace missing numerical values like
lng
andlat
with mean values as approximations.Drop rows with critical missing data in columns such as
accept_time
andpickup_time
.
4. Remove Duplicate Records
Description:
Duplicates can skew insights. Using the unique package_id
, we:
Check for duplicate rows.
Retain only one record for each
package_id
usingdrop_duplicates()
.
This ensures data integrity.
5. Convert Time Columns to Datetime Format
Description: Time-based calculations require a consistent format. We:
Convert time columns (
accept_time
,pickup_time
, etc.) to Python’sdatetime
format.Use
errors='coerce'
to handle invalid entries, setting them asNaT
(Not a Time).Investigate columns with invalid entries for corrective actions.
6. Calculate Derived Features
Description: Derived features add analytical depth and predictive value:
Time Window Duration: Compute the delivery window length: time_window_duration=time_window_end−time_window_start\text{time\_window\_duration} = \text{time\_window\_end} - \text{time\_window\_start}
Task Duration: Calculate the time from task acceptance to pickup: task_duration=pickup_time−accept_time\text{task\_duration} = \text{pickup\_time} - \text{accept\_time}
Distance Between Acceptance and Pickup: Using the Euclidean formula: distance=(pickup_gps_lng−accept_gps_lng)2+(pickup_gps_lat−accept_gps_lat)2\text{distance} = \sqrt{(\text{pickup\_gps\_lng} - \text{accept\_gps\_lng})^2 + (\text{pickup\_gps\_lat} - \text{accept\_gps\_lat})^2}
7. Identify and Filter Outliers
Description: Outliers can result from errors or unusual occurrences. To handle them:
Calculate the interquartile range (IQR) for
task_duration
.Define boundaries:
Lower Bound = Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}
Upper Bound = Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}
Filter out data points falling outside these bounds.
8. Save the Cleaned Dataset
Description:
The final dataset is saved as cleaned_pickup_data.csv
. This ensures all preprocessing efforts are preserved for subsequent steps.
Why Each Step is Important
Merging datasets consolidates fragmented data for efficient handling.
Handling missing values ensures no essential analysis is hindered by gaps.
Removing duplicates avoids inflating insights with repeated data.
Datetime conversion prepares the data for temporal analysis.
Derived features enhance the predictive and analytical power of the dataset.
Outlier removal improves data quality by addressing anomalies.
This systematic approach ensures a robust, clean, and insightful dataset ready for further analysis or machine learning applications.
Python-Based Preprocessing
Python Code for Data Preprocessing
1. Import Libraries
import pandas as pd
import numpy as np
from datetime import datetime
2. Load and Merge Datasets
# Load datasets for all cities
city1 = pd.read_csv("city1_pickup_data.csv")
city2 = pd.read_csv("city2_pickup_data.csv")
city3 = pd.read_csv("city3_pickup_data.csv")
city4 = pd.read_csv("city4_pickup_data.csv")
city5 = pd.read_csv("city5_pickup_data.csv")
# Add a city column (if missing)
city1['city'] = 'City1'
city2['city'] = 'City2'
city3['city'] = 'City3'
city4['city'] = 'City4'
city5['city'] = 'City5'
# Merge all datasets
merged_data = pd.concat([city1, city2, city3, city4, city5], ignore_index=True)
3. Handle Missing and Inconsistent Values
# Check for missing values
print(merged_data.isnull().sum())
# Fill missing values for categorical columns
merged_data['aoi_type'].fillna('Unknown', inplace=True)
merged_data['city'].fillna('Unknown City', inplace=True)
# Replace missing coordinates with regional averages
merged_data['lng'].fillna(merged_data['lng'].mean(), inplace=True)
merged_data['lat'].fillna(merged_data['lat'].mean(), inplace=True)
# Drop rows with critical missing timestamps
merged_data.dropna(subset=['accept_time', 'pickup_time'], inplace=True)
4. Remove Duplicate Records
# Remove duplicates based on package_id
merged_data.drop_duplicates(subset='package_id', inplace=True)
5. Convert Time Columns to Datetime
# Convert time columns to datetime format
time_columns = ['accept_time', 'accept_gps_time', 'pickup_time', 'pickup_gps_time', 'time_window_start', 'time_window_end', 'ds']
for col in time_columns:
merged_data[col] = pd.to_datetime(merged_data[col], errors='coerce')
# Check for invalid dates
print(merged_data[time_columns].isnull().sum())
6. Calculate Derived Features
Time Window Duration:
merged_data['time_window_duration'] = (merged_data['time_window_end'] - merged_data['time_window_start']).dt.total_seconds()
Acceptance to Pickup Duration:
merged_data['task_duration'] = (merged_data['pickup_time'] - merged_data['accept_time']).dt.total_seconds()
Distance Between Acceptance and Pickup:
# Define a function for Euclidean distance def calculate_distance(lng1, lat1, lng2, lat2): return np.sqrt((lng2 - lng1)**2 + (lat2 - lat1)**2) merged_data['distance'] = calculate_distance( merged_data['accept_gps_lng'], merged_data['accept_gps_lat'], merged_data['pickup_gps_lng'], merged_data['pickup_gps_lat'] )
7. Identify and Filter Outliers
# Remove outliers based on task_duration (e.g., duration should be reasonable)
q1 = merged_data['task_duration'].quantile(0.25)
q3 = merged_data['task_duration'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
merged_data = merged_data[(merged_data['task_duration'] >= lower_bound) & (merged_data['task_duration'] <= upper_bound)]
8. Save the Cleaned Dataset
# Save the preprocessed dataset
merged_data.to_csv("cleaned_pickup_data.csv", index=False)
Summary of Steps
Load and merge all city datasets into one.
Handle missing values by replacing or removing them.
Remove duplicates to ensure data integrity.
Convert time columns to the datetime format.
Calculate derived features, including durations and distances.
Filter outliers to ensure quality.
Save the cleaned dataset for further analysis.
Excel and BI Data Preprocessing
Here’s a detailed explanation of how to preprocess the Pickup Dataset using Excel and Power BI, broken down:
Steps for Preprocessing in Excel
Step 1: Consolidate Files
Open Excel.
Create a Master Workbook.
Import all five CSV files into separate sheets in this workbook.
Go to
Data > Get Data > From Text/CSV
.Load each file into separate tabs.
Add a new column
City
in each sheet (if not present), specifying the city name corresponding to the file.Use Copy-Paste or Power Query (from
Data > Get Data > Combine Queries
) to consolidate the data from all sheets into a single worksheet.
Step 2: Handle Missing Values
Use Conditional Formatting (
Home > Conditional Formatting > Highlight Cells Rules > Blanks
) to identify blank cells in critical columns (accept_time
,pickup_time
).Address missing values:
Numerical Columns (e.g.,
lng
,lat
): Fill with the column average using=AVERAGE(range)
.Categorical Columns (e.g.,
aoi_type
): Fill with "Unknown".
Remove rows where critical columns like
accept_time
orpickup_time
are missing.
Step 3: Remove Duplicate Records
Highlight the
package_id
column.Go to
Data > Remove Duplicates
and select all columns to ensure complete row duplication is considered.
Step 4: Convert Time Columns
Ensure all time columns (
accept_time
,pickup_time
) are formatted correctly:Select the column.
Go to
Home > Number > Custom
and choose a date-time format likedd/mm/yyyy hh:mm:ss
.
Investigate rows with incorrect or incomplete time data manually.
Step 5: Calculate Derived Features
Time Window Duration:
Add a new column
Time Window Duration
.Use
=time_window_end - time_window_start
to calculate the duration.
Task Duration:
Add a new column
Task Duration
.Use
=pickup_time - accept_time
.
Distance Between Acceptance and Pickup:
Add a new column
Distance
.Use the formula: =SQRT((pickup_gps_lng−accept_gps_lng)2+(pickup_gps_lat−accept_gps_lat)2)=SQRT((pickup\_gps\_lng - accept\_gps\_lng)^2 + (pickup\_gps\_lat - accept\_gps\_lat)^2)
Step 6: Identify and Filter Outliers
Calculate Q1 (25th percentile) and Q3 (75th percentile) for
Task Duration
using thePERCENTILE.INC
function.Compute the IQR (Interquartile Range): IQR=Q3−Q1IQR = Q3 - Q1
Identify outliers:
Lower Bound = Q1−1.5×IQRQ1 - 1.5 \times IQR
Upper Bound = Q3+1.5×IQRQ3 + 1.5 \times IQR
Filter out rows where
Task Duration
falls outside these bounds.
Step 7: Save the Cleaned Dataset
Save the final preprocessed data as
cleaned_pickup_data.xlsx
.
Steps for Preprocessing in Power BI
Step 1: Import and Merge Data
Open Power BI Desktop.
Go to
Home > Get Data > Text/CSV
and load each city’s dataset.Use the Power Query Editor:
Go to
Transform Data
.Use
Append Queries
to combine all datasets into one.Add a
City
column to distinguish the origin.
Step 2: Handle Missing Values
In Power Query, select
Transform > Replace Values
:Replace null values in numerical columns with the column mean (
Transform > Statistics > Average
).Replace null values in categorical columns with "Unknown".
Filter out rows with critical missing data in
accept_time
orpickup_time
.
Step 3: Remove Duplicates
Select the
package_id
column in Power Query.Use
Remove Duplicates
from the toolbar.
Step 4: Transform Time Columns
Convert columns like
accept_time
andpickup_time
to date-time format usingTransform > Data Type > Date/Time
.Create custom columns to calculate derived metrics (e.g., task duration, time window duration).
Step 5: Add Derived Features
Time Window Duration:
Use
Add Column > Custom Column
with:[time_window_end] - [time_window_start]
Task Duration:
Add a custom column:
[pickup_time] - [accept_time]
Distance Between Acceptance and Pickup:
Use:
sqrt(pow([pickup_gps_lng] - [accept_gps_lng], 2) + pow([pickup_gps_lat] - [accept_gps_lat], 2))
Step 6: Identify Outliers
Create a measure for Q1, Q3, and IQR using DAX functions.
Use DAX filters to exclude rows outside the lower and upper bounds.
Step 7: Export Cleaned Dataset
Once preprocessing is complete, export the dataset from Power BI:
Go to
Home > Export > Export Data
.
Last updated