Approximate Distances and Outlier Detection
Calculate approximate distances and identify outliers in Python efficiently. Here's how you can approach each task:
1. Calculate Approximate Distances
To calculate distances between two sets of GPS coordinates (longitude and latitude), you can use the Haversine formula, which computes the great-circle distance between two points on a sphere.
Steps in Python:
Install Dependencies (if not already installed):
pip install geopy
Haversine Formula Implementation:
import math def haversine(lon1, lat1, lon2, lat2): # Convert latitude and longitude from degrees to radians lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2]) # Haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2 c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a)) r = 6371 # Radius of the Earth in kilometers return r * c # Example: Compute distance between two points print(haversine(77.5946, 12.9716, 77.6784, 13.0827)) # Example coordinates
Using Geopy (Alternative):
from geopy.distance import geodesic coord1 = (12.9716, 77.5946) # (latitude, longitude) coord2 = (13.0827, 77.6784) print(geodesic(coord1, coord2).km)
Apply to Dataset: Use
pandas
to apply the function to your dataset:import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'accept_lng': [77.5946, 77.6012], 'accept_lat': [12.9716, 12.9750], 'pickup_lng': [77.6784, 77.6400], 'pickup_lat': [13.0827, 13.0000] }) # Apply haversine function row-wise df['distance_km'] = df.apply( lambda row: haversine(row['accept_lng'], row['accept_lat'], row['pickup_lng'], row['pickup_lat']), axis=1 ) print(df)
2. Identify and Filter Out Outliers
You can use statistical techniques to identify outliers, such as the Interquartile Range (IQR) method.
Steps in Python:
Calculate IQR:
Q1 = df['distance_km'].quantile(0.25) Q3 = df['distance_km'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR
Filter Out Outliers:
filtered_df = df[(df['distance_km'] >= lower_bound) & (df['distance_km'] <= upper_bound)] print(filtered_df)
Visualize Outliers (Optional): You can use boxplots for visualizing outliers:
import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(df['distance_km']) plt.show()
Automated Detection Using Z-Scores (Alternative):
from scipy.stats import zscore df['z_score'] = zscore(df['distance_km']) filtered_df = df[df['z_score'].abs() < 3] # Remove data points where z-score > 3
Final Steps:
Save Preprocessed Data:
filtered_df.to_csv("cleaned_pickup_data.csv", index=False)
Verify Results: Ensure that the filtered dataset retains the majority of valid records and accurately represents the data distribution.
Calculate approximate distances and identify outliers using Power BI or Excel, here’s how you can handle these steps without relying on Python:
1. Calculating Approximate Distances in Power BI and Excel
Power BI:
Import Data: Load the dataset into Power BI.
Add a Custom Column:
Use DAX (Data Analysis Expressions) to calculate the distance using the Haversine formula.
Sample DAX formula:
Distance_KM = VAR R = 6371 VAR Lat1 = RADIANS('Table'[accept_gps_lat]) VAR Lat2 = RADIANS('Table'[pickup_gps_lat]) VAR DeltaLat = RADIANS('Table'[pickup_gps_lat] - 'Table'[accept_gps_lat]) VAR DeltaLng = RADIANS('Table'[pickup_gps_lng] - 'Table'[accept_gps_lng]) VAR A = SIN(DeltaLat / 2) * SIN(DeltaLat / 2) + COS(Lat1) * COS(Lat2) * SIN(DeltaLng / 2) * SIN(DeltaLng / 2) VAR C = 2 * ATAN2(SQRT(A), SQRT(1 - A)) RETURN R * C
This calculates the great-circle distance in kilometers.
Visualize: Create a histogram or scatterplot of
Distance_KM
for further insights.
Excel:
Add Haversine Formula: Use the following formula in Excel in a new column to calculate distances:
= 6371 * ACOS(COS(RADIANS(90 - [accept_gps_lat])) * COS(RADIANS(90 - [pickup_gps_lat])) + SIN(RADIANS(90 - [accept_gps_lat])) * SIN(RADIANS(90 - [pickup_gps_lat])) * COS(RADIANS([pickup_gps_lng] - [accept_gps_lng])))
Named Ranges: Replace
[column names]
with the appropriate column references for latitude and longitude.
2. Identifying and Filtering Out Outliers
Power BI:
Create a Boxplot:
Use Power BI’s visualization tools or custom visuals (e.g., a box-and-whisker chart) to identify outliers in
Distance_KM
.Note any values beyond the whiskers (1.5 × IQR from Q1 and Q3) as outliers.
Filter Data:
Add a DAX measure to compute IQR:
Q1 = PERCENTILE.INC('Table'[Distance_KM], 0.25) Q3 = PERCENTILE.INC('Table'[Distance_KM], 0.75) IQR = Q3 - Q1 LowerBound = Q1 - 1.5 * IQR UpperBound = Q3 + 1.5 * IQR
Apply filters to exclude records outside the bounds.
Excel:
Compute IQR:
Use Excel’s
PERCENTILE
function to calculate Q1 and Q3:Q1 = PERCENTILE.INC([Distance_KM], 0.25) Q3 = PERCENTILE.INC([Distance_KM], 0.75)
Calculate IQR:
IQR = Q3 - Q1 Lower Bound = Q1 - 1.5 * IQR Upper Bound = Q3 + 1.5 * IQR
Filter Rows:
Use Excel’s filter feature to remove rows where
Distance_KM
is outside the lower and upper bounds.
Last updated