Approximate Distances and Outlier Detection

Calculate approximate distances and identify outliers in Python efficiently. Here's how you can approach each task:

1. Calculate Approximate Distances

To calculate distances between two sets of GPS coordinates (longitude and latitude), you can use the Haversine formula, which computes the great-circle distance between two points on a sphere.

Steps in Python:

Install Dependencies (if not already installed):
```
pip install geopy
```

Haversine Formula Implementation:

import math

def haversine(lon1, lat1, lon2, lat2):
    # Convert latitude and longitude from degrees to radians
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])

    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    r = 6371  # Radius of the Earth in kilometers
    return r * c

# Example: Compute distance between two points
print(haversine(77.5946, 12.9716, 77.6784, 13.0827))  # Example coordinates

Using Geopy (Alternative):

from geopy.distance import geodesic

coord1 = (12.9716, 77.5946)  # (latitude, longitude)
coord2 = (13.0827, 77.6784)
print(geodesic(coord1, coord2).km)

Apply to Dataset: Use pandas to apply the function to your dataset:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'accept_lng': [77.5946, 77.6012],
    'accept_lat': [12.9716, 12.9750],
    'pickup_lng': [77.6784, 77.6400],
    'pickup_lat': [13.0827, 13.0000]
})

# Apply haversine function row-wise
df['distance_km'] = df.apply(
    lambda row: haversine(row['accept_lng'], row['accept_lat'], row['pickup_lng'], row['pickup_lat']),
    axis=1
)
print(df)

2. Identify and Filter Out Outliers

You can use statistical techniques to identify outliers, such as the Interquartile Range (IQR) method.

Steps in Python:

Calculate IQR:

Q1 = df['distance_km'].quantile(0.25)
Q3 = df['distance_km'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

Filter Out Outliers:

filtered_df = df[(df['distance_km'] >= lower_bound) & (df['distance_km'] <= upper_bound)]
print(filtered_df)

Visualize Outliers (Optional): You can use boxplots for visualizing outliers:

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(df['distance_km'])
plt.show()

Automated Detection Using Z-Scores (Alternative):

from scipy.stats import zscore

df['z_score'] = zscore(df['distance_km'])
filtered_df = df[df['z_score'].abs() < 3]  # Remove data points where z-score > 3

Final Steps:

Save Preprocessed Data:

filtered_df.to_csv("cleaned_pickup_data.csv", index=False)

Verify Results: Ensure that the filtered dataset retains the majority of valid records and accurately represents the data distribution.

Calculate approximate distances and identify outliers using Power BI or Excel, here’s how you can handle these steps without relying on Python:

1. Calculating Approximate Distances in Power BI and Excel

Power BI:

Import Data: Load the dataset into Power BI.

Add a Custom Column:

Use DAX (Data Analysis Expressions) to calculate the distance using the Haversine formula.

Sample DAX formula:

Distance_KM = 
VAR R = 6371 
VAR Lat1 = RADIANS('Table'[accept_gps_lat])
VAR Lat2 = RADIANS('Table'[pickup_gps_lat])
VAR DeltaLat = RADIANS('Table'[pickup_gps_lat] - 'Table'[accept_gps_lat])
VAR DeltaLng = RADIANS('Table'[pickup_gps_lng] - 'Table'[accept_gps_lng])
VAR A = SIN(DeltaLat / 2) * SIN(DeltaLat / 2) + 
        COS(Lat1) * COS(Lat2) * SIN(DeltaLng / 2) * SIN(DeltaLng / 2)
VAR C = 2 * ATAN2(SQRT(A), SQRT(1 - A))
RETURN R * C

This calculates the great-circle distance in kilometers.

Visualize: Create a histogram or scatterplot of Distance_KM for further insights.

Excel:

Add Haversine Formula: Use the following formula in Excel in a new column to calculate distances:

= 6371 * ACOS(COS(RADIANS(90 - [accept_gps_lat])) * COS(RADIANS(90 - [pickup_gps_lat])) + SIN(RADIANS(90 - [accept_gps_lat])) * SIN(RADIANS(90 - [pickup_gps_lat])) * COS(RADIANS([pickup_gps_lng] - [accept_gps_lng])))

Named Ranges: Replace [column names] with the appropriate column references for latitude and longitude.

2. Identifying and Filtering Out Outliers

Power BI:

Create a Boxplot:
- Use Power BI’s visualization tools or custom visuals (e.g., a box-and-whisker chart) to identify outliers in Distance_KM.
- Note any values beyond the whiskers (1.5 × IQR from Q1 and Q3) as outliers.

Filter Data:

Add a DAX measure to compute IQR:

Q1 = PERCENTILE.INC('Table'[Distance_KM], 0.25)
Q3 = PERCENTILE.INC('Table'[Distance_KM], 0.75)
IQR = Q3 - Q1
LowerBound = Q1 - 1.5 * IQR
UpperBound = Q3 + 1.5 * IQR

Apply filters to exclude records outside the bounds.

Excel:

Compute IQR:

Use Excel’s PERCENTILE function to calculate Q1 and Q3:

Q1 = PERCENTILE.INC([Distance_KM], 0.25)
Q3 = PERCENTILE.INC([Distance_KM], 0.75)

Calculate IQR:

IQR = Q3 - Q1
Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

Filter Rows:
- Use Excel’s filter feature to remove rows where Distance_KM is outside the lower and upper bounds.

PreviousData Merging Guide NextDashboard-I

Last updated 8 months ago