Preprocessing Using Power BI and Excel
Last updated
Last updated
Steps to Follow:
Load Data into Power BI:
Open Power BI Desktop.
Use Get Data to import all five city datasets.
Append the datasets by selecting Home > Transform Data > Append Queries.
Add a City Column:
If the city column is missing, use the Add Column > Custom Column feature in Power Query to include the city name.
Remove Duplicates:
In Power Query, use Remove Duplicates for package_id
to ensure no duplicate rows.
Handle Missing Values:
Replace or remove null values using Transform > Replace Values or Remove Rows.
Create Delivery Duration:
Add a custom column in Power Query to calculate delivery duration:
Duration = [delivery_time] - [accept_time]
Compute Geographical Distance:
If distance calculations are needed, you can use DAX formulas after loading the data into the model, or preprocess using approximate calculations for now.
Filter Outliers:
Visualize delivery durations and distances in Power BI visuals.
Use filters to exclude unrealistic values.
Save the Processed Data:
Apply all changes in Power Query, and use Export Data from visuals if a cleaned file is needed.
Steps to Follow:
Merge Data:
Copy and paste each city dataset into a master workbook.
Add a column to each dataset specifying the city name.
Remove Duplicates:
Use Data > Remove Duplicates on package_id
.
Handle Missing Values:
Use filters or Conditional Formatting to identify and replace/remove null values.
Calculate Delivery Duration:
Create a new column:
Format the column as time.
Approximate Distance Calculation:
Add a formula for Euclidean distance (simplified for flat surfaces):
Detect Outliers:
Use Conditional Formatting or a pivot table to highlight extremely high/low delivery durations or distances.
Save the Processed File:
Save the cleaned file as a new workbook.
Once the preprocessing is completed in Power BI or Excel:
Validate the data quality by reviewing summary statistics.
Use this cleaned dataset for analysis and modeling.
=delivery_time - accept_time
=SQRT((delivery_gps_lng - accept_gps_lng)^2 + (delivery_gps_lat - accept_gps_lat)^2)