Pickup Data Preprocessing

Here is a detailed description of all the steps in the Python data preprocessing pipeline for the pickup dataset:


Step-by-Step Description

1. Import Libraries

Description: We begin by importing necessary Python libraries:

  • pandas for data manipulation.

  • numpy for numerical computations.

  • datetime for handling date and time formats.

These libraries form the foundation for handling, transforming, and analyzing data.


2. Load and Merge Datasets

Description: The dataset is spread across five CSV files for different cities. We:

  1. Load each file into a pandas DataFrame.

  2. Add a city column to track the origin of each record (if not already present).

  3. Merge all files into a single DataFrame using pd.concat() to enable unified processing.


3. Handle Missing and Inconsistent Values

Description: Missing or inconsistent data can distort analyses. Here’s how we address them:

  1. Identify missing values using .isnull().sum() to assess data health.

  2. Fill categorical data (aoi_type, city) with placeholders like "Unknown".

  3. Replace missing numerical values like lng and lat with mean values as approximations.

  4. Drop rows with critical missing data in columns such as accept_time and pickup_time.


4. Remove Duplicate Records

Description: Duplicates can skew insights. Using the unique package_id, we:

  1. Check for duplicate rows.

  2. Retain only one record for each package_id using drop_duplicates().

This ensures data integrity.


5. Convert Time Columns to Datetime Format

Description: Time-based calculations require a consistent format. We:

  1. Convert time columns (accept_time, pickup_time, etc.) to Python’s datetime format.

  2. Use errors='coerce' to handle invalid entries, setting them as NaT (Not a Time).

  3. Investigate columns with invalid entries for corrective actions.


6. Calculate Derived Features

Description: Derived features add analytical depth and predictive value:

  1. Time Window Duration: Compute the delivery window length: time_window_duration=time_window_end−time_window_start\text{time\_window\_duration} = \text{time\_window\_end} - \text{time\_window\_start}

  2. Task Duration: Calculate the time from task acceptance to pickup: task_duration=pickup_time−accept_time\text{task\_duration} = \text{pickup\_time} - \text{accept\_time}

  3. Distance Between Acceptance and Pickup: Using the Euclidean formula: distance=(pickup_gps_lng−accept_gps_lng)2+(pickup_gps_lat−accept_gps_lat)2\text{distance} = \sqrt{(\text{pickup\_gps\_lng} - \text{accept\_gps\_lng})^2 + (\text{pickup\_gps\_lat} - \text{accept\_gps\_lat})^2}


7. Identify and Filter Outliers

Description: Outliers can result from errors or unusual occurrences. To handle them:

  1. Calculate the interquartile range (IQR) for task_duration.

  2. Define boundaries:

    • Lower Bound = Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}

    • Upper Bound = Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}

  3. Filter out data points falling outside these bounds.


8. Save the Cleaned Dataset

Description: The final dataset is saved as cleaned_pickup_data.csv. This ensures all preprocessing efforts are preserved for subsequent steps.


Why Each Step is Important

  1. Merging datasets consolidates fragmented data for efficient handling.

  2. Handling missing values ensures no essential analysis is hindered by gaps.

  3. Removing duplicates avoids inflating insights with repeated data.

  4. Datetime conversion prepares the data for temporal analysis.

  5. Derived features enhance the predictive and analytical power of the dataset.

  6. Outlier removal improves data quality by addressing anomalies.

This systematic approach ensures a robust, clean, and insightful dataset ready for further analysis or machine learning applications.

Python-Based Preprocessing


Python Code for Data Preprocessing

1. Import Libraries

import pandas as pd
import numpy as np
from datetime import datetime

2. Load and Merge Datasets

# Load datasets for all cities
city1 = pd.read_csv("city1_pickup_data.csv")
city2 = pd.read_csv("city2_pickup_data.csv")
city3 = pd.read_csv("city3_pickup_data.csv")
city4 = pd.read_csv("city4_pickup_data.csv")
city5 = pd.read_csv("city5_pickup_data.csv")

# Add a city column (if missing)
city1['city'] = 'City1'
city2['city'] = 'City2'
city3['city'] = 'City3'
city4['city'] = 'City4'
city5['city'] = 'City5'

# Merge all datasets
merged_data = pd.concat([city1, city2, city3, city4, city5], ignore_index=True)

3. Handle Missing and Inconsistent Values

# Check for missing values
print(merged_data.isnull().sum())

# Fill missing values for categorical columns
merged_data['aoi_type'].fillna('Unknown', inplace=True)
merged_data['city'].fillna('Unknown City', inplace=True)

# Replace missing coordinates with regional averages
merged_data['lng'].fillna(merged_data['lng'].mean(), inplace=True)
merged_data['lat'].fillna(merged_data['lat'].mean(), inplace=True)

# Drop rows with critical missing timestamps
merged_data.dropna(subset=['accept_time', 'pickup_time'], inplace=True)

4. Remove Duplicate Records

# Remove duplicates based on package_id
merged_data.drop_duplicates(subset='package_id', inplace=True)

5. Convert Time Columns to Datetime

# Convert time columns to datetime format
time_columns = ['accept_time', 'accept_gps_time', 'pickup_time', 'pickup_gps_time', 'time_window_start', 'time_window_end', 'ds']

for col in time_columns:
    merged_data[col] = pd.to_datetime(merged_data[col], errors='coerce')

# Check for invalid dates
print(merged_data[time_columns].isnull().sum())

6. Calculate Derived Features

  1. Time Window Duration:

    merged_data['time_window_duration'] = (merged_data['time_window_end'] - merged_data['time_window_start']).dt.total_seconds()
  2. Acceptance to Pickup Duration:

    merged_data['task_duration'] = (merged_data['pickup_time'] - merged_data['accept_time']).dt.total_seconds()
  3. Distance Between Acceptance and Pickup:

    # Define a function for Euclidean distance
    def calculate_distance(lng1, lat1, lng2, lat2):
        return np.sqrt((lng2 - lng1)**2 + (lat2 - lat1)**2)
    
    merged_data['distance'] = calculate_distance(
        merged_data['accept_gps_lng'], merged_data['accept_gps_lat'],
        merged_data['pickup_gps_lng'], merged_data['pickup_gps_lat']
    )

7. Identify and Filter Outliers

# Remove outliers based on task_duration (e.g., duration should be reasonable)
q1 = merged_data['task_duration'].quantile(0.25)
q3 = merged_data['task_duration'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

merged_data = merged_data[(merged_data['task_duration'] >= lower_bound) & (merged_data['task_duration'] <= upper_bound)]

8. Save the Cleaned Dataset

# Save the preprocessed dataset
merged_data.to_csv("cleaned_pickup_data.csv", index=False)

Summary of Steps

  1. Load and merge all city datasets into one.

  2. Handle missing values by replacing or removing them.

  3. Remove duplicates to ensure data integrity.

  4. Convert time columns to the datetime format.

  5. Calculate derived features, including durations and distances.

  6. Filter outliers to ensure quality.

  7. Save the cleaned dataset for further analysis.

Excel and BI Data Preprocessing

Here’s a detailed explanation of how to preprocess the Pickup Dataset using Excel and Power BI, broken down:


Steps for Preprocessing in Excel

Step 1: Consolidate Files

  1. Open Excel.

  2. Create a Master Workbook.

  3. Import all five CSV files into separate sheets in this workbook.

    • Go to Data > Get Data > From Text/CSV.

    • Load each file into separate tabs.

  4. Add a new column City in each sheet (if not present), specifying the city name corresponding to the file.

  5. Use Copy-Paste or Power Query (from Data > Get Data > Combine Queries) to consolidate the data from all sheets into a single worksheet.


Step 2: Handle Missing Values

  1. Use Conditional Formatting (Home > Conditional Formatting > Highlight Cells Rules > Blanks) to identify blank cells in critical columns (accept_time, pickup_time).

  2. Address missing values:

    • Numerical Columns (e.g., lng, lat): Fill with the column average using =AVERAGE(range).

    • Categorical Columns (e.g., aoi_type): Fill with "Unknown".

  3. Remove rows where critical columns like accept_time or pickup_time are missing.


Step 3: Remove Duplicate Records

  1. Highlight the package_id column.

  2. Go to Data > Remove Duplicates and select all columns to ensure complete row duplication is considered.


Step 4: Convert Time Columns

  1. Ensure all time columns (accept_time, pickup_time) are formatted correctly:

    • Select the column.

    • Go to Home > Number > Custom and choose a date-time format like dd/mm/yyyy hh:mm:ss.

  2. Investigate rows with incorrect or incomplete time data manually.


Step 5: Calculate Derived Features

  1. Time Window Duration:

    • Add a new column Time Window Duration.

    • Use =time_window_end - time_window_start to calculate the duration.

  2. Task Duration:

    • Add a new column Task Duration.

    • Use =pickup_time - accept_time.

  3. Distance Between Acceptance and Pickup:

    • Add a new column Distance.

    • Use the formula: =SQRT((pickup_gps_lng−accept_gps_lng)2+(pickup_gps_lat−accept_gps_lat)2)=SQRT((pickup\_gps\_lng - accept\_gps\_lng)^2 + (pickup\_gps\_lat - accept\_gps\_lat)^2)


Step 6: Identify and Filter Outliers

  1. Calculate Q1 (25th percentile) and Q3 (75th percentile) for Task Duration using the PERCENTILE.INC function.

  2. Compute the IQR (Interquartile Range): IQR=Q3−Q1IQR = Q3 - Q1

  3. Identify outliers:

    • Lower Bound = Q1−1.5×IQRQ1 - 1.5 \times IQR

    • Upper Bound = Q3+1.5×IQRQ3 + 1.5 \times IQR

  4. Filter out rows where Task Duration falls outside these bounds.


Step 7: Save the Cleaned Dataset

  1. Save the final preprocessed data as cleaned_pickup_data.xlsx.


Steps for Preprocessing in Power BI

Step 1: Import and Merge Data

  1. Open Power BI Desktop.

  2. Go to Home > Get Data > Text/CSV and load each city’s dataset.

  3. Use the Power Query Editor:

    • Go to Transform Data.

    • Use Append Queries to combine all datasets into one.

    • Add a City column to distinguish the origin.


Step 2: Handle Missing Values

  1. In Power Query, select Transform > Replace Values:

    • Replace null values in numerical columns with the column mean (Transform > Statistics > Average).

    • Replace null values in categorical columns with "Unknown".

  2. Filter out rows with critical missing data in accept_time or pickup_time.


Step 3: Remove Duplicates

  1. Select the package_id column in Power Query.

  2. Use Remove Duplicates from the toolbar.


Step 4: Transform Time Columns

  1. Convert columns like accept_time and pickup_time to date-time format using Transform > Data Type > Date/Time.

  2. Create custom columns to calculate derived metrics (e.g., task duration, time window duration).


Step 5: Add Derived Features

  1. Time Window Duration:

    • Use Add Column > Custom Column with:

      [time_window_end] - [time_window_start]
  2. Task Duration:

    • Add a custom column:

      [pickup_time] - [accept_time]
  3. Distance Between Acceptance and Pickup:

    • Use:

      sqrt(pow([pickup_gps_lng] - [accept_gps_lng], 2) + pow([pickup_gps_lat] - [accept_gps_lat], 2))

Step 6: Identify Outliers

  1. Create a measure for Q1, Q3, and IQR using DAX functions.

  2. Use DAX filters to exclude rows outside the lower and upper bounds.


Step 7: Export Cleaned Dataset

  1. Once preprocessing is complete, export the dataset from Power BI:

    • Go to Home > Export > Export Data.


Last updated