Addressing Geospatial Outliers in Python: the Simple MAD-Based Approach
Just like any other data, when dealing with geospatial data, identifying and correcting outliers is a crucial step in data preparation that ensures the accuracy of any subsequent analysis. Outliers can significantly skew the results of spatial analyses, leading to incorrect conclusions. While there are other approach to this, one straightforward and effective method to handle these outliers is to use the Median Absolute Deviation (MAD) method. In this write-up, we’ll explore this simple yet powerful MAD-based approach to identify and adjust geospatial outliers in Python, making your data analysis more robust and reliable.
Getting Started with the Data
First, let’s generate a sample dataset to work with. This dataset simulates a collection of geographical coordinates (longitude and latitude) for a set of points. We have deliberately included a couple of outliers to illustrate how they can be detected and corrected.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# For demonstration, we will create a sample dataframe to include coordinates
np.random.seed(0) # For reproducible results
df = pd.DataFrame({
'_longitude': np.concatenate([np.random.normal(loc=2.44, scale=0.01, size=1000), np.array([2.35, 2.36])]),
'_latitude': np.concatenate([np.random.normal(loc=6.37, scale=0.01, size=1000), np.array([6.385, 6.39])])
})
In the above code, we create a DataFrame ‘df’ with 1000 normally distributed points around a central location, and intentionally add two outliers to both the longitude and latitude columns.
Identifying Outliers with the MAD Method
The MAD method is particularly effective in identifying outliers in a dataset. In our quest, MAD acts as a detector of these misplaced dots. Unlike simply eyeballing the map for oddities, MAD quantifies the deviation of points from the group’s median—a robust center point less influenced by outliers. Here’s how we can implement MAD-based outlier detection:
def mad_based_outlier(points, threshold=3):
# The heart of our map: the median.
median = np.median(points)
# How far off the path each point is.
deviation = np.abs(points - median)
# The 'average' deviation in our landscape.
mad = np.median(deviation)
# A score to identify those who wander too far.
modified_z_score = 0.6745 * deviation / mad
return modified_z_score > threshold
This step transforms our deviations into something called modified Z-scores. The magic number 0.6745 helps adjust our scores to more closely resemble the standard Z-scores, which typically assume a normal distribution. The Z-score tells us how many standard deviations a point is from the mean; similarly, our modified Z-score indicates how many MADs away a point is from the median. Points that are far from the median get a high score. Points with a score higher than the specified threshold (commonly set to 3) are then flagged as outliers. However, it would be good to visualize this to be sure these are outliers and specify the threshold based on your observation.
Applying MAD Outlier Detection
Next, we apply the MAD-based outlier detection function to both the longitude and latitude columns of our DataFrame. This step assesses each longitude and latitude value to determine if it significantly deviates from the median, based on the MAD approach we’ve outlined. In both cases, the function returns a boolean array (True for outliers, False for inliers), flagging each point as either an outlier or not based on its respective dimension.
# Implement the mad_based_outlier function to check for outliers in both longitude and latitude
outliers_longitude = mad_based_outlier(df['_longitude'])
outliers_latitude = mad_based_outlier(df['_latitude'])
By combining these outliers, we can visualize the inliers and outliers in our dataset. After separately identifying outliers in both longitude and latitude, we combine these findings to get a complete picture of outliers in our dataset. The | operator is a logical “OR” operation applied element-wise across the Boolean arrays returned for longitude and latitude. This means a point is considered an outlier if it’s flagged as such in either its longitude or latitude (or both). This approach acknowledges that geospatial outliers can exist in one dimension and not necessarily the other but still impact the overall dataset’s integrity.
# Combine the latitude and longitude to plot the outliers
outliers = outliers_longitude | outliers_latitude
Visualizing Before and After Outlier Adjustment
As I mentioned earlier, before adjusting the outliers, it’s helpful to visualize them alongside the inliers:
# Plot geopoints before fixing outliers
plt.figure(figsize=(10, 6))
plt.scatter(df[~outliers]['_longitude'], df[~outliers]['_latitude'], c='blue', label='Inliers')
plt.scatter(df[outliers]['_longitude'], df[outliers]['_latitude'], c='red', label='Outliers')
plt.xlabel('_longitude')
plt.ylabel('_latitude')
plt.title('MAD Outlier Detection Before Fixing')
plt.legend()
plt.show()
Note: Since you are dealing with geospatial data, you need to be sure these are really outliers before going ahead to clean them up. Secondly, if you think some possible inliers are detected as outliers and you need to only capture the extreme outliers, you can increase the threshold to achieve this.
After identifying the outliers, and once we are certain these are really outliers, we can remove them or replace them with the median values of the inliers to ensure they no longer skew our analysis. To replace them with the median values of inliers:
# replace outliers with median value of inliers
df.loc[outliers, '_longitude'] = df.loc[~outliers, '_longitude'].median()
df.loc[outliers, '_latitude'] = df.loc[~outliers, '_latitude'].median()
Finally, we visualize the dataset again, this time with the outliers fixed:
# Plot geopoints after fixing outliers
plt.figure(figsize=(10, 6))
plt.scatter(df['_longitude'], df['_latitude'], c='blue', label='Inliers')
plt.xlabel('_longitude')
plt.ylabel('_latitude')
plt.title('MAD Outlier Detection After Fixing')
plt.legend()
plt.show()
The MAD-based approach to identifying and adjusting geospatial outliers in Python is both simple and effective. By applying this method, you can significantly improve the quality of your geospatial data, leading to more accurate and reliable analyses.
You can also find the code repository here on GitHub.
Let’s connect via GitHub | LinkedIn | Twitter | YouTube
MY RECOMMENDED SITES TO LEVEL UP YOUR DATA SCIENCE SKILLS - REGISTER AND START LEARNING TODAY
→ DataCamp