Approximate Distances and Outlier Detection

Calculate approximate distances and identify outliers in Python efficiently. Here's how you can approach each task:


1. Calculate Approximate Distances

To calculate distances between two sets of GPS coordinates (longitude and latitude), you can use the Haversine formula, which computes the great-circle distance between two points on a sphere.

Steps in Python:

  1. Install Dependencies (if not already installed):

    pip install geopy
  2. Haversine Formula Implementation:

    import math
    
    def haversine(lon1, lat1, lon2, lat2):
        # Convert latitude and longitude from degrees to radians
        lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    
        # Haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
        r = 6371  # Radius of the Earth in kilometers
        return r * c
    
    # Example: Compute distance between two points
    print(haversine(77.5946, 12.9716, 77.6784, 13.0827))  # Example coordinates
  3. Using Geopy (Alternative):

    from geopy.distance import geodesic
    
    coord1 = (12.9716, 77.5946)  # (latitude, longitude)
    coord2 = (13.0827, 77.6784)
    print(geodesic(coord1, coord2).km)
  4. Apply to Dataset: Use pandas to apply the function to your dataset:

    import pandas as pd
    
    # Sample DataFrame
    df = pd.DataFrame({
        'accept_lng': [77.5946, 77.6012],
        'accept_lat': [12.9716, 12.9750],
        'pickup_lng': [77.6784, 77.6400],
        'pickup_lat': [13.0827, 13.0000]
    })
    
    # Apply haversine function row-wise
    df['distance_km'] = df.apply(
        lambda row: haversine(row['accept_lng'], row['accept_lat'], row['pickup_lng'], row['pickup_lat']),
        axis=1
    )
    print(df)

2. Identify and Filter Out Outliers

You can use statistical techniques to identify outliers, such as the Interquartile Range (IQR) method.

Steps in Python:

  1. Calculate IQR:

    Q1 = df['distance_km'].quantile(0.25)
    Q3 = df['distance_km'].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
  2. Filter Out Outliers:

    filtered_df = df[(df['distance_km'] >= lower_bound) & (df['distance_km'] <= upper_bound)]
    print(filtered_df)
  3. Visualize Outliers (Optional): You can use boxplots for visualizing outliers:

    import seaborn as sns
    import matplotlib.pyplot as plt
    
    sns.boxplot(df['distance_km'])
    plt.show()
  4. Automated Detection Using Z-Scores (Alternative):

    from scipy.stats import zscore
    
    df['z_score'] = zscore(df['distance_km'])
    filtered_df = df[df['z_score'].abs() < 3]  # Remove data points where z-score > 3

Final Steps:

  • Save Preprocessed Data:

    filtered_df.to_csv("cleaned_pickup_data.csv", index=False)
  • Verify Results: Ensure that the filtered dataset retains the majority of valid records and accurately represents the data distribution.

Calculate approximate distances and identify outliers using Power BI or Excel, here’s how you can handle these steps without relying on Python:


1. Calculating Approximate Distances in Power BI and Excel

Power BI:

  1. Import Data: Load the dataset into Power BI.

  2. Add a Custom Column:

    • Use DAX (Data Analysis Expressions) to calculate the distance using the Haversine formula.

    • Sample DAX formula:

      Distance_KM = 
      VAR R = 6371 
      VAR Lat1 = RADIANS('Table'[accept_gps_lat])
      VAR Lat2 = RADIANS('Table'[pickup_gps_lat])
      VAR DeltaLat = RADIANS('Table'[pickup_gps_lat] - 'Table'[accept_gps_lat])
      VAR DeltaLng = RADIANS('Table'[pickup_gps_lng] - 'Table'[accept_gps_lng])
      VAR A = SIN(DeltaLat / 2) * SIN(DeltaLat / 2) + 
              COS(Lat1) * COS(Lat2) * SIN(DeltaLng / 2) * SIN(DeltaLng / 2)
      VAR C = 2 * ATAN2(SQRT(A), SQRT(1 - A))
      RETURN R * C
    • This calculates the great-circle distance in kilometers.

  3. Visualize: Create a histogram or scatterplot of Distance_KM for further insights.

Excel:

  1. Add Haversine Formula: Use the following formula in Excel in a new column to calculate distances:

    = 6371 * ACOS(COS(RADIANS(90 - [accept_gps_lat])) * COS(RADIANS(90 - [pickup_gps_lat])) + SIN(RADIANS(90 - [accept_gps_lat])) * SIN(RADIANS(90 - [pickup_gps_lat])) * COS(RADIANS([pickup_gps_lng] - [accept_gps_lng])))
  2. Named Ranges: Replace [column names] with the appropriate column references for latitude and longitude.


2. Identifying and Filtering Out Outliers

Power BI:

  1. Create a Boxplot:

    • Use Power BI’s visualization tools or custom visuals (e.g., a box-and-whisker chart) to identify outliers in Distance_KM.

    • Note any values beyond the whiskers (1.5 × IQR from Q1 and Q3) as outliers.

  2. Filter Data:

    • Add a DAX measure to compute IQR:

      Q1 = PERCENTILE.INC('Table'[Distance_KM], 0.25)
      Q3 = PERCENTILE.INC('Table'[Distance_KM], 0.75)
      IQR = Q3 - Q1
      LowerBound = Q1 - 1.5 * IQR
      UpperBound = Q3 + 1.5 * IQR
    • Apply filters to exclude records outside the bounds.

Excel:

  1. Compute IQR:

    • Use Excel’s PERCENTILE function to calculate Q1 and Q3:

      Q1 = PERCENTILE.INC([Distance_KM], 0.25)
      Q3 = PERCENTILE.INC([Distance_KM], 0.75)
    • Calculate IQR:

      IQR = Q3 - Q1
      Lower Bound = Q1 - 1.5 * IQR
      Upper Bound = Q3 + 1.5 * IQR
  2. Filter Rows:

    • Use Excel’s filter feature to remove rows where Distance_KM is outside the lower and upper bounds.

Last updated