Environmental and Spatial Predictors of Bird Diversity across the US

Spatial Analysis of BBS Data with Python🐍

Spatial Data

Birds

Python

Author

Matěj Tvarůžka

Published

February 16, 2026

1 Introduction

One of the fundamental questions in biology is: What is the origin of biodiversity? Why are there so many distinct species at the same site? These questions can be examined through evolutionary and ecological perspectives, which are often connected. The ecological approach is crucial because it links the evolutionary history of coexisting species with the environments in which they occur and examines the factors that allow not just the build-up of local (and consequently regional) species richness but also its maintenance over time. One factor that facilitates species coexistence in birds and therefore drives their species richness (number of species at the site) on the regional scale is environmental primary productivity (Pigot, Tobias, and Jetz 2016).

At broad spatial scales, species richness typically increases with productivity (Hawkins, Porter, and Diniz-Filho 2003). In contrast, at finer spatial scales, the relationship between productivity and species richness is more complex: positive, negative, hump-shaped, and U-shaped relationships have all been documented (Mittelbach et al. 2001). There are several hypotheses for this relationship between productivity and species richness, but we still do not know the ultimate mechanism behind it (Di Cecco et al. 2022). In this assignment, I will examine how vegetation productivity drives species richness in birds across North America.

I will use the Breeding Bird Survey (BBS) dataset, which is one of the most significant and longest-running “citizen science” programmes in the world. Established in 1966, its primary goal was to monitor the status and trends of North American bird populations. This census operates at the landscape scale, with routes about 40 km long and 50 stops where volunteers count birds. As a measure of vegetation primary productivity and environmental heterogeneity, I decided to use the Normalised Difference Vegetation Index (NDVI), which is commonly used as a proxy for primary productivity in both regional (Nieto, Flombaum, and Garbulsky 2015) and local studies (Brüggeshemke and Fartmann 2025).

Research Questions

How does vegetation density/heterogeneity predict bird diversity across the United States?
How spatial predictors explain the patterns of species richness in North American birds?

Tasks to cover

Prepare and explore the data - clean BBS routes, extract NDVI for route geometry, and get elevation data with Earth Engine.
Evaluate global and local spatial autocorrelation in species richness and model residuals.
Build the OLS model and compare it with GWR model.

View Code

import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import rioxarray
import xvec
import folium
import ee
import geemap
import esda
import contextily
from libpysal import graph
import statsmodels.formula.api as smf
from mgwr.gwr import GWR
from mgwr.sel_bw import Sel_BW
from sklearn.metrics import mean_squared_error

2 Data preparation

2.1 Study area

I chose the lower 48 states of the USA as the study area because their route geometries are publicly available. In the basic BBS dataset, the geometry is included only for the starting points of the routes, which are on average about 40 km long, so the single point does not optimally represent their geometry. For basic spatial data preparation, we will use the geopandas library. First, we need to define the boundary of the study area.

View Code

us_states = gpd.read_file("study_area/cb_2018_us_state_500k.shp")
exclude = ["AK", "HI", "PR", "VI", "MP", "GU", "AS"] # States outside the area of interest
us_states = us_states.query("STUSPS not in @exclude") # Accessing the exclude variable outside the query() string
us_boundary = us_states.dissolve()

2.2 BBS data

Now it is time to load the data about species richness, which consists of the number of species detected at the route. I decided to use data from 2018, which I already prepared in R, and computed species richness for each transect. This data has one geometry column with the coordinates of the starting point of the route. I downloaded the data at the official project site.

View Code

bbs = gpd.read_file("bbs/bbs_routes_SR_2018.gpkg", engine = "fiona")

bbs = bbs.loc[bbs["CountryNum"] == 840] # Select only US transects
bbs = bbs.loc[bbs["StateNum"] != 3] # Select every state except Alaska
bbs["RTENO"] = bbs["RTENO"].astype(str).str[3:] # Slice the string (skips the first 3 characters
bbs.head()

	RTENO	RouteName	StateNum	Route	CountryNum	species_richness	geometry
0	02001	ST FLORIAN	2	001	840	63	POINT (-87.60414 34.86869)
1	02004	TRADE	2	004	840	51	POINT (-87.05924 34.02979)
2	02007	SWAIM	2	007	840	63	POINT (-86.20304 34.86804)
3	02008	PINE GROVE	2	008	840	66	POINT (-85.71104 34.02781)
4	02010	MILLERVILLE	2	010	840	59	POINT (-85.92584 33.19135)

Let’s load the route data.

View Code

routes = gpd.read_file("routes/bbsrtsl020.shp")
routes.head()

	FNODE_	TNODE_	LENGTH	BBSRTSL020	ARC_LENGTH	RTENO	RTENAME	RTELENG	geometry
0	3	4	0.41414	1	39658.1	2072	ELKMONT	39658.1	LINESTRING (-87.34746 34.94893, -87.3461 34.94...
1	5	6	0.41892	2	39343.2	2001	ST FLORIAN	39343.2	LINESTRING (-87.99044 34.905, -87.9879 34.9049...
2	7	12	0.44543	3	44896.5	2204	STEVENSON	44896.5	LINESTRING (-85.77996 34.88146, -85.77946 34.8...
3	13	1	0.40238	4	41350.3	2073	WOODVILLE	41350.3	LINESTRING (-86.23356 34.72212, -86.23337 34.7...
4	10	9	0.37732	5	37331.3	2071	WHEELER DAM	37331.3	LINESTRING (-87.58241 34.74839, -87.58255 34.7...

Unfortunately, the route geometries are often inaccurate and sometimes consist of several fragments, which can cause problems. Therefore, we first need to remove potentially wrong route geometries. The expected route geometry is about 40 km, so let us filter potential outliers from this value.

Important

This cleaning will result in the loss of several hundred data points, but we need precise route geometries for data extraction.

View Code

routes = routes.to_crs(epsg=5070) # First set the crs to USA Contiguous Albers Equal Area Conic projection
routes["calc_length_km"] = routes.geometry.length / 1000 # Calculating the route length for given geometry
routes = routes.loc[routes["RTELENG"] <= 45000,] # Selecting routes under 45 km in length by the dataset measure
mask = (routes["calc_length_km"] >= 35) & (routes["calc_length_km"] <= 45) # Filter out fragments and over-long routes
routes_filtered = routes[mask].copy()

print(f"Original routes count: {len(routes)}")
print(f"Filtered routes count (35-45km): {len(routes_filtered)}")
print(f"Average route length: {routes_filtered["calc_length_km"].mean():.2f} km")

Original routes count: 3062
Filtered routes count (35-45km): 2570
Average route length: 40.14 km

Let us perform the spatial join of these two GeoDataFrames. To combine BBS data with route geometries, we use a spatial join with “inner” because we need matching values in both tables. I could also join the tables on the “RTENO” column, but I want to be sure the geometries match.

View Code

bbs = bbs.to_crs(routes_filtered.crs) # Make sure it's in the same coordinate system
bbs = bbs.drop(columns = "RTENO")

bbs["geometry"] = bbs.buffer(400) # Add a small buffer (400 m) to points to account for GPS inaccuracy

joined_data = gpd.sjoin(routes_filtered, bbs, how="inner", predicate="intersects") # Spatial join of bbs points data with the routes geometries

print(f"Routes in subset: {len(routes_filtered)}")
print(f"Routes with data: {len(joined_data["RTENO"].unique())}")

Routes in subset: 2570
Routes with data: 1611

The large drop in the number of routes is due to missing BBS data for the route geometries, which could be due to annual differences in the census - in some years, particular routes are not counted due to weather conditions, terrain obstacles or illness of the observer. However, the number is still very large, so there are probably some other errors and inaccuracies in geometry matching. Now we will do the final preparation of the table for the NDVI and elevation values extraction. We need to get rid of redundant columns, create buffer around the routes, add the route midpoint and create Lat/Lon columns for external API.

View Code

bbs_routes = joined_data[["RTENO", "species_richness", "geometry"]].copy() # Add .copy() here to break the link to joined_data

bbs_routes["buff_3000"] = bbs_routes.buffer(3000)  # Creating 3 km buffer - buffer choice explained in NDVI extraction part
bbs_routes["route_midpoint"] = bbs_routes.geometry.interpolate(0.5, normalized=True) # Calculate the midpoint (50% along the line) as a reference geometry
bbs_routes = bbs_routes.drop(columns = "geometry") # We no longer need routes geometry

temp_gdf = bbs_routes.set_geometry("route_midpoint").to_crs(4326)
bbs_routes["lon"] = temp_gdf.geometry.x
bbs_routes["lat"] = temp_gdf.geometry.y
bbs_routes = bbs_routes.set_geometry("buff_3000")
bbs_routes.head()

	RTENO	species_richness	buff_3000	route_midpoint	lon	lat
0	2072	63	POLYGON ((782767.664 1352804.909, 782613.87 13...	POINT (799058.576 1358112.615)	-87.165585	34.947661
2	2204	54	POLYGON ((911518.247 1350136.581, 911484.644 1...	POINT (920348.359 1362929.663)	-85.828469	34.882272
3	2073	65	POLYGON ((883221.105 1343004.45, 883424.384 13...	POINT (896739.418 1353510.697)	-86.098287	34.820791
4	2071	61	POLYGON ((766029.847 1327436.538, 765484.087 1...	POINT (776778.876 1331128.227)	-87.437364	34.725655
5	2007	63	POLYGON ((869910.569 1340151.781, 869882.814 1...	POINT (879587.76 1341112.337)	-86.300271	34.726327

2.3 NDVI raster

The Normalised difference vegetation index (NDVI) indicates vegetation health within raster image pixels by quantifying the amount of near-infrared (NIR) light plants reflect. Healthy vegetation reflects a higher proportion of NIR and a lower proportion of red light than stressed vegetation or non-living surfaces with similar visible colours (for example, turf fields). This contrast makes NDVI a practical tool for evaluating vegetation health in raster images relative to their surroundings.

NDVI is calculated for each pixel with the following calculation:

\(\Large NDVI = \frac{(NIR - Red)}{(NIR + Red)}\)

This formula generates a value between -1 and +1. Low reflectance in the red channel and high reflectance in the NIR channel will yield a high NDVI value (healthy vegetation), while the inverse will result in a low NDVI value (unhealthy vegetation). Negative values typically represent non-vegetation, such as water or rock.

The NDVI raster for the whole area would be very large and computationally demanding. Therefore, I decided to choose a different approach and work with a compressed 8-bit raster (values from 0 to 255) from NASA Earth Observations (NEO), which represents a 1-month average for June. The values in this file have been scaled and resampled for visualisation in NEO and are not suitable for rigorous scientific analysis. However, they may be used for basic assessments and for identifying general trends, which is exactly the goal of this study. We only need to resolve the differences in value scaling. We will extract the mean NDVI values for a 3 km buffer; these distances have been shown to preserve higher predictive power for covariates in studies of breeding birds (Byer et al. 2025). For manipulating raster data, we will use the libraries rioxarray (for reading raster files), xvec (for data extraction and zonal statistics computation), and folium for map visualisations.

View Code

ndvi = rioxarray.open_rasterio("MOD_NDVI_M_2018-06-01_rgb_3600x1800.TIFF")

ndvi_us = ndvi.rio.clip(us_boundary.to_crs(ndvi.rio.crs).geometry) # Matching CRS and cliping to study area
ndvi_us = ndvi_us.drop_vars("band").squeeze() # Getting rid of redundant dimension
ndvi_us = ndvi_us.where(ndvi_us>0) # Filtering null values

print(f"Raster CRS: {ndvi_us.rio.crs}") # Check the CRS to see if it's in degrees (4326) or meters

Raster CRS: EPSG:4326

View Code

_ = ndvi_us.plot()

We can see from the map that there is a very high NDVI on the east coast compared to the west of US. Now we can extract values from the raster for the routes buffer we created earlier. We will extract: 1) Mean value of NDVI for each buffer 2) Standard deviation of NDVI for each buffer - it will help estimate the heterogeneity of the vegetation.

View Code

ndvi_us = ndvi_us.rio.reproject(bbs_routes.crs) # Reprojecting to EPSG: 5070
ndvi_extraction = bbs_routes.to_crs(ndvi_us.rio.crs)

zonal_stats = ndvi_us.drop_vars("spatial_ref").xvec.zonal_stats( # Extracting values for NDVI from buffers
    geometry=ndvi_extraction.buff_3000,
    x_coords="x",
    y_coords="y",
    stats=["mean", "std"],
)
zonal_stats.head()

Let us check if there are some missing values in the data.

View Code

ndvi = zonal_stats.xvec.to_geodataframe(name="NDVI") # Transforming zonal statistics into GeoDataFrame
print(ndvi.isna().sum()) # Returns the count of NaNs for every column in the dataframe

geometry     0
index        0
NDVI        66
dtype: int64

View Code

ndvi_clean = ndvi.dropna(subset=["NDVI"]) # Dropping rows where the NDVI value is missing

ndvi_stats = ndvi_clean.reset_index().pivot(
    index="index", 
    columns="zonal_statistics", 
    values="NDVI"
) # This creates a table with index as rows and mean/std/min/max as columns

ndvi_stats.columns = ["ndvi_" + col for col in ndvi_stats.columns] # Add the "ndvi_" prefix to the new columns
geometries = ndvi_clean[["index", "geometry"]].drop_duplicates("index").set_index("index") # Get the unique geometries for each index
ndvi_data = geometries.join(ndvi_stats).reset_index() # Join them together
ndvi_data.head()

	index	geometry	ndvi_mean	ndvi_std
0	0	POLYGON ((782767.664 1352804.909, 782613.87 13...	231.666672	7.586538
1	2	POLYGON ((911518.247 1350136.581, 911484.644 1...	243.000000	7.874008
2	3	POLYGON ((883221.105 1343004.45, 883424.384 13...	252.500000	2.598076
3	4	POLYGON ((766029.847 1327436.538, 765484.087 1...	201.500000	2.500000
4	5	POLYGON ((869910.569 1340151.781, 869882.814 1...	235.250000	20.728905

Although the NDVI data were retrieved in 8-bit format (0–255), we can standardise them to Z-scores prior to modelling. This process centres the data and removes the original units, ensuring that the model coefficients are not biased by the satellite imagery’s original bit-depth.

View Code

# Standardise the Mean (Relative Productivity)
ndvi_data["ndvi_mean_z"] = (ndvi_data["ndvi_mean"] - ndvi_data["ndvi_mean"].mean()) / ndvi_data["ndvi_mean"].std()
# Standardise the Std (Relative Heterogeneity)
ndvi_data["ndvi_heterog_z"] = (ndvi_data["ndvi_std"] - ndvi_data["ndvi_std"].mean()) / ndvi_data["ndvi_std"].std()

Now that we have standardised our data, we can perform the spatial join with the bbs_routes GeoDataFrame.

View Code

ndvi_to_join = ndvi_data[["ndvi_mean_z", "ndvi_heterog_z", "geometry"]]

buffer = bbs_routes.set_geometry("buff_3000")

bbs_ndvi = gpd.sjoin(
    buffer, 
    ndvi_to_join, 
    how="inner", # Keeps only rows that exist in both sets
    predicate="intersects" # Ensures the polygons must overlap/be the same
)

bbs_ndvi = bbs_ndvi.drop(columns="index_right") # Dropping index_right
data = bbs_ndvi.drop_duplicates(subset=["RTENO"]) # Check for duplicates
data.head()

	RTENO	species_richness	buff_3000	route_midpoint	lon	lat	ndvi_mean_z	ndvi_heterog_z
0	2072	63	POLYGON ((782767.664 1352804.909, 782613.87 13...	POINT (799058.576 1358112.615)	-87.165585	34.947661	0.782085	-0.136629
2	2204	54	POLYGON ((911518.247 1350136.581, 911484.644 1...	POINT (920348.359 1362929.663)	-85.828469	34.882272	0.997218	-0.101972
3	2073	65	POLYGON ((883221.105 1343004.45, 883424.384 13...	POINT (896739.418 1353510.697)	-86.098287	34.820791	0.850105	1.447823
4	2071	61	POLYGON ((766029.847 1327436.538, 765484.087 1...	POINT (776778.876 1331128.227)	-87.437364	34.725655	0.209451	-0.749866
5	2007	63	POLYGON ((869910.569 1340151.781, 869882.814 1...	POINT (879587.76 1341112.337)	-86.300271	34.726327	0.850105	1.447823

Now we can take a look at the map of relative NDVI and see if it coresponds with our expectations.

View Code

m = data.explore(
    column="ndvi_mean_z", 
    cmap="RdYlGn",
    marker_kwds={'radius':5},
    vmin=-2.5, 
    vmax=2.5, 
    legend=True,
    tooltip=["ndvi_mean_z"],
    tiles="CartoDB positron",
    popup=True # Click to see all data for that route
)
m

Make this Notebook Trusted to load map: File -> Trust Notebook

Important

A visible challenge for our spatial model is the high density and overlap of the routes in the Northeast. This concentration introduces sampling bias (over-representing certain environments) and pseudoreplication, where overlapping buffers sample nearly identical pixels, violating the assumption of independent observations.

To mitigate this, I implemented a spatial thinning procedure using the function thin_spatial_geopandas. This algorithm utilises a spatial index (sindex) to efficiently identify route buffers within a 1 m threshold and randomly selects one representative route from overlapping pairs for retention, while discarding the rest. This ensures a more geographically balanced dataset for subsequent analysis.

View Code

def thin_spatial_geopandas(data, distance_threshold=1):

    # Ensure we are working with a copy and shuffle for randomness
    gdf_temp = data.copy().sample(frac=1, random_state=42).reset_index(drop=True)
    
    keep_indices = []
    discard_indices = []
    processed = np.zeros(len(gdf_temp), dtype=bool)

    sindex = gdf_temp.sindex   # Build a spatial index

    for i in range(len(gdf_temp)):
        if processed[i]:
            continue
        
        keep_indices.append(i)     # Keep this route
        processed[i] = True
        
        current_geom = gdf_temp.geometry.iloc[i]
        
        # Find all neighbours within the distance threshold
        # query returns indices of geometries whose bounding box intersects the search area, then filter by actual distance
        potential_neighbors = sindex.query(current_geom.buffer(distance_threshold))
        
        for neighbor_idx in potential_neighbors:
            if not processed[neighbor_idx] and neighbor_idx != i:
                if current_geom.distance(gdf_temp.geometry.iloc[neighbor_idx]) < distance_threshold:
                    discard_indices.append(neighbor_idx)
                    processed[neighbor_idx] = True
                    
    return gdf_temp.iloc[keep_indices], gdf_temp.iloc[discard_indices]

df_thinned, df_removed = thin_spatial_geopandas(data, distance_threshold=1)

print(f"Routes kept: {len(df_thinned)}")
print(f"Routes removed: {len(df_removed)}")

Routes kept: 1442
Routes removed: 150

We can check the result with interactive map and by selecting layers see exactly which routes were removed and kept.

View Code

df_thinned["status"] = "Kept"
df_removed["status"] = "Removed"

m = df_thinned.explore(color="#2ca02c", name="Kept Routes")
df_removed.explore(m=m, color="#d62728", name="Removed Routes")

folium.LayerControl(collapsed=False).add_to(m)

m

Make this Notebook Trusted to load map: File -> Trust Notebook

View Code

bbs_ndvi = df_thinned.drop(columns = "buff_3000") # We no longer need buffer geometry
bbs_ndvi = bbs_ndvi.set_geometry("route_midpoint")
bbs_ndvi.head()

	RTENO	species_richness	route_midpoint	lon	lat	ndvi_mean_z	ndvi_heterog_z	status
0	60130	26	POINT (-687991.066 1056389.17)	-103.354305	32.340717	-2.006735	-0.573757	Kept
1	72016	65	POINT (1411937.291 2204439.401)	-78.818844	41.692045	1.028855	-0.141942	Kept
2	91036	55	POINT (446389.92 2420135.718)	-90.357021	44.651087	0.146810	0.078148	Kept
3	34068	47	POINT (553981.276 2070695.428)	-89.311522	41.466289	1.044674	-0.629305	Kept
4	91023	56	POINT (493986.861 2548870.791)	-89.648401	45.779301	0.873833	-0.629305	Kept

2.4 Elevation data

To get the elevation data, I integrated topographic data from the NASA Digital Elevation Model (NASADEM). I used the Google Earth Engine API and the ee library to perform cloud-based spatial join operations.

The extraction process involved several technical steps:

Feature Transformation: Local coordinate pairs (Latitude/Longitude) for each survey route midpoint were converted into ee.Feature objects to make them compatible with the GEE spatial engine.

High-Resolution Sampling: I accessed the NASADEM Global 30m dataset. Using a 30-metre sampling scale, the extraction matched the native resolution of the satellite sensor, ensuring that the elevation value assigned to each route accurately reflects the local terrain.
The .sampleRegions() function performs the spatial join on Google’s servers, meaning we only download the final elevation values rather than the entire global raster.
Data Integration: The extracted values were pulled from the cloud using the geemap library and merged back into the data_final DataFrame using the unique route identifier RTENO.

View Code

#  Initialise Earth Engine Project ID
try:
    ee.Initialize(project="my-project-ee-470714")
except Exception as e:
    ee.Authenticate()
    ee.Initialize(project="my-project-ee-470714")
    
# Convert DataFrame rows into GEE Features
features = []
for index, row in bbs_ndvi.iterrows():
    # GEE uses [Longitude, Latitude]
    point = ee.Geometry.Point([row["lon"], row["lat"]])
    feat = ee.Feature(point, {"RTENO": row["RTENO"]})
    features.append(feat)

points_ee = ee.FeatureCollection(features)

# Load the NASADEM Global 30m dataset
nasadem = ee.Image("NASA/NASADEM_HGT/001").select("elevation")

# Sample the image at road midpoints
sampled_points = nasadem.sampleRegions(
    collection=points_ee,
    properties=["RTENO"],
    scale=30
)

info = sampled_points.getInfo()
features = info['features']
dict_list = [f['properties'] for f in features]
df_elev = pd.DataFrame(dict_list)
data_final = bbs_ndvi.merge(df_elev[["RTENO", "elevation"]], on="RTENO", how="left")

print("Elevation successfully extracted from GEE!")

Elevation successfully extracted from GEE!

Prerequisites: Google Earth Engine Access

To execute the code block above with your own device, you must have a registered Google Earth Engine account.

Registration: Sign up at earthengine.google.com.
Cloud Project: You must create a project in the Google Cloud Console.
Project ID: Replace "my-project-ee-470714" in the code with your specific Google Cloud Project ID. Without a valid ID linked to an enabled Earth Engine API, the initialisation will fail.

Now we can look at the elevation histogram using seaborn library.

View Code

sns.displot(data_final["elevation"])

View Code

data_final.to_file("assignment_data.gpkg", layer="bbs_routes", driver="GPKG") # Save to a GeoPackage
data = gpd.read_file("assignment_data.gpkg")
data.head()

	RTENO	species_richness	lon	lat	ndvi_mean_z	ndvi_heterog_z	status	elevation	geometry
0	60130	26	-103.354305	32.340717	-2.006735	-0.573757	Kept	1073	POINT (-687991.066 1056389.17)
1	72016	65	-78.818844	41.692045	1.028855	-0.141942	Kept	508	POINT (1411937.291 2204439.401)
2	91036	55	-90.357021	44.651087	0.146810	0.078148	Kept	369	POINT (446389.92 2420135.718)
3	34068	47	-89.311522	41.466289	1.044674	-0.629305	Kept	221	POINT (553981.276 2070695.428)
4	91023	56	-89.648401	45.779301	0.873833	-0.629305	Kept	502	POINT (493986.861 2548870.791)

3 Spatial autocorrelation

With the data prepared, we first evaluate the extent of spatial dependency in bird species richness using a three-pronged diagnostic approach:

Moran plot - to see the global degree of clustering
Correlogram - to see how the spatial autocorrelation changes with larger neighbourhoods
LISA (Local Indicators of Spatial Association) - to see the distribution of significantly spatially autocorrelated values.

For this, we will use the libraries PySAL (for spatial weights), esda (for Moran plot and LISA), and contextily for basemaps.

Firstly, we need to build spatial weights with graph function using K-Nearest Neighbors. Than we will calculate spatial lag, Moran’s I and correlogram.

View Code

knn5 = graph.Graph.build_knn(data, k=5)
knn5 = knn5.transform("R") # Standardising rows

# Standardise richness (Z-score)
data["richness_std"] = (data["species_richness"] - data["species_richness"].mean()) / data["species_richness"].std()
# Calculate spatial lag using the KNN5 weights you already built
data["richness_lag"] = knn5.lag(data["richness_std"])

# Calculate Moran's I
moran = esda.moran.Moran(data['species_richness'], knn5)

# Calculate correlogram
k = [5, 10, 25, 50, 75, 100, 500, 1000]
correlogram = esda.correlogram(
  geometry=data.representative_point(),
  variable=data["species_richness"],
  support=k,
  distance_type="knn",
)

We can plot the results and see if there is some spatial autocorrelation in the data.

View Code

f, ax = plt.subplots(1, 2, figsize=(16, 7))

# --- PANEL 1: MORAN SCATTERPLOT ---
sns.regplot(
    x="richness_std",
    y="richness_lag",
    data=data,
    marker=".",
    scatter_kws={"alpha": 0.3, "color": "steelblue"},
    line_kws=dict(color="lightcoral"),
    ax=ax[0]
)
ax[0].set_aspect("equal")
ax[0].axvline(0, c="black", alpha=0.5, linewidth=1)
ax[0].axhline(0, c="black", alpha=0.5, linewidth=1)
ax[0].set_title(f"A. Moran Plot (Global $I$: {moran.I:.3f})", fontsize=14, loc='left')
ax[0].set_xlabel("Standardised Species Richness")
ax[0].set_ylabel("Spatial Lag")

# Add Quadrant Labels
for label, x_pos, y_pos in [("HH", 0.9, 0.9), ("HL", 0.9, 0.1), ("LH", 0.1, 0.9), ("LL", 0.1, 0.1)]:
    ax[0].text(x_pos, y_pos, label, transform=ax[0].transAxes, fontweight="bold", fontsize=12)

# --- PANEL 2: SPATIAL CORRELOGRAM ---
ax[1].plot(k, correlogram.I, marker="o", linestyle="-", color="steelblue", markersize=4)
ax[1].axhline(0, color="black", linestyle="--", alpha=0.5)
ax[1].set_title("B. Spatial Scale Decay (Correlogram)", fontsize=14, loc='left')
ax[1].set_xlabel("Number of Nearest Neighbors (K)")
ax[1].set_ylabel("Moran's I")
ax[1].grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

The Global Moran’s I value confirms a strong, positive spatial autocorrelation (I≈0.497), indicating that bird species richness is not randomly distributed across the United States but is significantly geographically clustered. The spatial correlogram demonstrates that spatial influence is strongest at small scales but drops significantly when the neighbourhood size reaches approximately 100 neighbours. This inflexion point suggests that the ecological processes driving bird richness operate within a specific regional range.

View Code

lisa = esda.moran.Moran_Local(data["species_richness"], knn5.to_W())

f, ax = plt.subplots(figsize=(12, 10))
lisa.plot(data, crit_value=0.05, ax=ax, alpha=0.6, markersize=20) 
ax.set_title("LISA Cluster Map (p < 0.05)", fontsize=14)
contextily.add_basemap(ax, crs=data.crs.to_string(), source=contextily.providers.CartoDB.PositronNoLabels)

plt.show()

The LISA map reveals a distinct East-West dichotomy:

Eastern US (High-High): Dominated by biodiversity hotspots, likely driven by higher annual precipitation and structural vegetation density (NDVI).
Arid West (Low-Low): Exhibits extensive coldspots, reflecting the environmental constraints of higher elevation and lower primary productivity.

4 Linear regression

4.1 OLS model

Now we can continue with linear regression. We already know there is high spatial autocorrelation; therefore, our ultimate goal is to use geographically weighted linear regression (GWR). First, we will build a simple Ordinary Least Squares (OLS) model and identify the residuals. However, this model will already consist of two spatial predictors - longitude and latitude.

We will use statsmodel (for the OLS model) and the mgwr package.

View Code

# Define our dependent variable (target) and independent variables (predictors)
dependent = "species_richness"
independents = ["ndvi_mean_z", "ndvi_heterog_z", "elevation", "lon", "lat"]

formula = f"{dependent} ~ {" + ".join(independents)}"
print(f"OLS Formula: {formula}")

OLS Formula: species_richness ~ ndvi_mean_z + ndvi_heterog_z + elevation + lon + lat

View Code

ols = smf.ols(formula, data=data).fit()

ols.summary()

OLS Regression Results
Dep. Variable:	species_richness	R-squared:	0.329
Model:	OLS	Adj. R-squared:	0.327
Method:	Least Squares	F-statistic:	141.1
Date:	Sun, 15 Feb 2026	Prob (F-statistic):	6.76e-122
Time:	17:10:45	Log-Likelihood:	-5542.3
No. Observations:	1442	AIC:	1.110e+04
Df Residuals:	1436	BIC:	1.113e+04
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	51.1524	3.096	16.521	0.000	45.079	57.226
ndvi_mean_z	5.8241	0.449	12.976	0.000	4.944	6.705
ndvi_heterog_z	1.3671	0.313	4.368	0.000	0.753	1.981
elevation	-0.0015	0.001	-2.415	0.016	-0.003	-0.000
lon	0.1071	0.029	3.693	0.000	0.050	0.164
lat	0.3419	0.066	5.201	0.000	0.213	0.471

Omnibus:	10.586	Durbin-Watson:	1.967
Prob(Omnibus):	0.005	Jarque-Bera (JB):	11.957
Skew:	0.141	Prob(JB):	0.00253
Kurtosis:	3.346	Cond. No.	9.16e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Note

The Global OLS model indicates that relative primary productivity ndvi_mean_z is the dominant driver of bird species richness across the US, and the model explains about one third of the data (R2=0.329). Bird diversity decreases slightly with increasing elevation and increases with vegetation heterogeneity. Longitude and latitude have minimal explanatory power. Furthermore, the OLS model assumes these relationships are constant across the continent. To investigate whether these drivers vary geographically—for instance, if NDVI is more critical in the arid West than the humid East—I will now proceed to Geographically Weighted Regression (GWR). But first, we take a look at the spatial perspective of the OLS model.

We can compare the model prediction with the actual species richness.

View Code

predicted = ols.predict(data)

With calculated prediction we can now make a comparison with the actual data.

View Code

f, axs = plt.subplots(1, 2, figsize=(24, 7))
data.plot(
    predicted, legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[0]
)
data.plot(
    "species_richness", legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[1]
)
axs[0].set_title("OLS prediction", fontsize=20)
axs[1].set_title("Actual Species Richness", fontsize=20)

axs[0].set_axis_off()
axs[1].set_axis_off()

In the map on the left, we can observe that the model predicts the expected difference in the species richness between East and West of the US. Hence, it both overestimates and underestimates the predictions in some areas, because our data (map on the right) shows a more diverse pattern.

View Code

data["residual"] = ols.resid
max_residual = ols.resid.abs().max()
lisa = esda.Moran_Local(data['residual'], knn5.to_W())
lisa.explore(data, prefer_canvas=True,  marker_kwds={'radius':4}, tiles="CartoDB Positron")

Make this Notebook Trusted to load map: File -> Trust Notebook

The map present’s residuals, locations where models fail to explain our dependent variable. You can spot the red areas which have higher species richness than expected by the model, while the blue areas show the opposite.

4.2 Geographically weighted regression

Now we can finally progress to GWR model, which overcomes the problems of OLS by fitting a regression model to the data in several geographically different regions and solve the problem with spatial autocorrelation.

View Code

# 1. Define Dependent (y) and Independent (X) variables
# We reshape y to (-1, 1) to ensure it is a column vector for matrix math
y = data["species_richness"].values.reshape((-1, 1))

# We select our environmental predictors
# We exclude lon/lat from X because they are used in "coords" instead
X = data[["ndvi_mean_z", "ndvi_heterog_z", "elevation"]].values

# 2. Define the spatial coordinates
coords = data.centroid.get_coordinates().values

To specify how big regions model should use for searching we will compute optimal bandwith (how many neighbouring routes should be considered for model fitting) using function Sel_BW from mgwr.

View Code

# Initialize the bandwidth selector
bw_selector = Sel_BW(coords, y, X, fixed=False, multi=False)

# Search for the optimal bandwidth (number of nearest neighbors)
opt_bw = bw_selector.search()

print(f"Optimal bandwidth (neighbors): {opt_bw}")

Optimal bandwidth (neighbors): 80.0

Now we can fit the GWR model using GWR function.

View Code

# Fit the model
gwr_model = GWR(coords, y, X, opt_bw).fit()

# View the global summary of the local models
gwr_model.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                1442
Number of covariates:                                                     4

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                         188213.517
Log-likelihood:                                                   -5558.494
AIC:                                                              11124.988
AICc:                                                             11127.030
BIC:                                                             177753.813
R2:                                                                   0.314
Adj. R2:                                                              0.313

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  54.352      0.458    118.587      0.000
X1                                   7.121      0.383     18.574      0.000
X2                                   1.381      0.313      4.420      0.000
X3                                  -0.001      0.001     -2.067      0.039

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Bandwidth used:                                                      80.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                         122138.628
Effective number of parameters (trace(S)):                          160.282
Degree of freedom (n - trace(S)):                                  1281.718
Sigma estimate:                                                       9.762
Log-likelihood:                                                   -5246.719
AIC:                                                              10816.001
AICc:                                                             10856.906
BIC:                                                              11666.566
R2:                                                                   0.555
Adjusted R2:                                                          0.499
Adj. alpha (95%):                                                     0.001
Adj. critical t value (95%):                                          3.234

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       55.634      9.885     19.849     55.695     85.198
X1                        4.078      5.228    -14.156      4.573     16.293
X2                        1.041      2.064     -6.011      1.245      6.923
X3                        0.002      0.037     -0.083     -0.002      0.200
===========================================================================

The GWR model increased the explained variability (R2) to 55 % and we can now do a visual comparison with the OLS model.

View Code

# 1. Prepare data for plotting
y_obs = data["species_richness"]
ols_pred = ols.predict(data)
gwr_pred = gwr_model.predy.flatten()

# 2. Set the aesthetic style
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(1, 2, figsize=(14, 7), sharey=True, sharex=True)

# 3. OLS Scatterplott
sns.scatterplot(x=y_obs, y=ols_pred, alpha=0.5, ax=ax[0], color="#1f77b4")
ax[0].plot([0, 100], [0, 100], color="red", linestyle="--", lw=2) 
ols_rmse = np.sqrt(mean_squared_error(y_obs, ols_pred))

ax[0].set_title(f"OLS: Observed vs Predicted\n(RMSE: {ols_rmse:.2f})", fontsize=16, pad=15)

# 4. GWR Scatterplot
sns.scatterplot(x=y_obs, y=gwr_pred, alpha=0.5, ax=ax[1], color="#2ca02c")
ax[1].plot([0, 100], [0, 100], color="red", linestyle="--", lw=2)
gwr_rmse = np.sqrt(mean_squared_error(y_obs, gwr_pred))

ax[1].set_title(f"GWR: Observed vs Predicted\n(RMSE: {gwr_rmse:.2f})", fontsize=16, pad=15)

for a in ax:
    a.set_xlim(0, 100)
    a.set_ylim(0, 100)
    a.set_xlabel("Observed Species Richness", fontsize=12)

ax[0].set_ylabel("Predicted Species Richness", fontsize=12)

plt.tight_layout()
plt.show()

The plots above show the comparison between predicted and observed values. The more tightly clustered the points are around the red line, the more accurate the model’s prediction. Root mean square error (RMSE) stands for the average distance of the points from the red line and was calculated using the function mean_squared_error from the sklearn library. We can clearly see that the GWR model predicts species richness much better than the OLS model with an average prediction error (RMSE) of about 9 species per route. We can inspect it further using maps.

View Code

f, axs = plt.subplots(1, 2, figsize=(24, 7))
data.plot(
    gwr_model.predy.flatten(), legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[0]
)
data.plot(
    "species_richness", legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[1]
)
axs[0].set_title("GWR Prediction (Local Model)", fontsize=20)
axs[1].set_title("Actual Species Richness", fontsize=20)

axs[0].set_axis_off()
axs[1].set_axis_off()

As we can see from the maps, the GWR model better explains the variability in the actual data, further highlighting the contrast between East and West; however, it still fails to predict very high or very low values in some areas. To examine them specifically, we can look at the local β-coefficients.

View Code

# 1. Define significance threshold
sig95 = gwr_model.adj_alpha[1]
critical_t = gwr_model.critical_tval(alpha=sig95)
critical_t
significant = np.abs(gwr_model.tvalues) > critical_t

# 2. Get the actual number of variables from the model
num_vars = gwr_model.params.shape[1]

# 3. Build var_names
actual_predictor_names = ["ndvi_mean_z", "ndvi_heterog_z", "elevation"]
var_names = ["Intercept"] + actual_predictor_names

# 4. Create the grid
cols = 2
rows = (num_vars + cols - 1) // cols 

fig, axs = plt.subplots(rows, cols, figsize=(12, rows * 5))
axs = axs.flatten()

for i in range(num_vars):
    # Plot the local coefficients
    data.plot(
        gwr_model.params[:, i], 
        cmap="coolwarm", 
        ax=axs[i], 
        markersize=15, # Increased size for better visibility
        legend=True,
        legend_kwds={"shrink": 0.5}
    )
    
    # Significance Mask
    non_sig_mask = ~significant[:, i]
    if non_sig_mask.any():
        data[non_sig_mask].plot(
            color="white", 
            ax=axs[i], 
            alpha=0.7, 
            markersize=15
        )
    
    # Check if we have a name for this index to avoid IndexError
    title_name = var_names[i] if i < len(var_names) else f"Variable {i}"
    axs[i].set_title(f"Local Coef: {title_name}", fontsize=12)
    axs[i].set_axis_off()

# Hide any extra axes
for j in range(i + 1, len(axs)):
    axs[j].set_axis_off()

plt.tight_layout()
plt.show()

The intercept coefficient shows the species richness that the model did not explain. We can see clusters of species-poor routes in the rain shadow of the Rocky Mountains and Great Basin, while biodiversity hotspots are situated mainly in the area of the Great Lakes. These patterns, not captured by the model, could also be explained by a combination of factors such as regional history, the presence of nature protection areas, and human influence (like agricultural intensity).

Relative NDVI significantly drives bird diversity in the arid west, especially in desert habitats, where even a slight increase in vegetation density can lead to a more diverse community of breeding birds. The vegetation heterogeneity is significant only in the Great Basin and has a relatively weak effect. Finally, in Florida, elevation explains species richness, but this is very likely not the correct prediction, as Florida has very low elevation heterogeneity and differs only significantly from the other states in average elevation, which is undoubtedly not the driver of bird diversity in this area. The generally lower species diversity of the Florida Peninsula in the model is driven by factors not included, such as high human population density or climate instability.

5 Conclusion

This work confirms that spatially varying processes drive continental-scale variation in bird species richness. Global Moran’s I and LISA maps reveal an expected East-West dichotomy, with the East having, on average, more biodiversity hotspots than the West. While OLS models capture broad trends, GWR significantly improves prediction accuracy.

Local coefficients show that NDVI drives species richness in the arid West, where marginal greenness spikes diversity. Despite the temporarily limited data and proxy variable for NDVI, the final GWR model adequately revealed the expected SR~NDVI relationship, but should be treated as exploratory, with results highly limited in significance.

Coding workflow

I used Gemini to enhance plots and help with some of the stuff in this assignment, like elevation extraction with GEE. I always reviewed the AI output and tried to apply the approaches learned in the course as much as possible.

Course page: Spatial Data Science for Social Geography

5.1 References

Brüggeshemke, Jonas, and Thomas Fartmann. 2025. “Predicting Species Richness and Abundance of Breeding Birds by Remote-Sensing and Field-Survey Data.” Ecological Solutions and Evidence 6 (4): e70170. https://doi.org/10.1002/2688-8319.70170.

Byer, Nathan W., Remington J. Moll, Timothy J. Krynak, et al. 2025. “Breeding Bird Sensitivity to Urban Habitat Quality Is Multi-Scale and Strongly Dependent on Migratory Behavior.” Ecological Applications 35 (1): e3087. https://doi.org/10.1002/eap.3087.

Di Cecco, Grace J., Sara J. Snell Taylor, Ethan P. White, and Allen H. Hurlbert. 2022. “More Individuals or Specialized Niches? Distinguishing Support for Hypotheses Explaining Positive Species–Energy Relationships.” Journal of Biogeography 49 (9): 1629–39. https://doi.org/10.1111/jbi.14459.

Hawkins, Bradford A., Eric E. Porter, and José Alexandre Felizola Diniz-Filho. 2003. “Productivity and History as Predictors of the Latitudinal Diversity Gradient of Terrestrial Birds.” Ecology 84 (6): 1608–23. https://doi.org/10.1890/0012-9658(2003)084[1608:PAHAPO]2.0.CO;2.

Mittelbach, Gary G., Christopher F. Steiner, Samuel M. Scheiner, et al. 2001. “What Is the Observed Relationship Between Species Richness and Productivity?” Ecology 82 (9): 2381–96. https://doi.org/10.1890/0012-9658(2001)082[2381:WITORB]2.0.CO;2.

Nieto, Sebastián, Pedro Flombaum, and Martín F. Garbulsky. 2015. “Can Temporal and Spatial NDVI Predict Regional Bird-Species Richness?” Global Ecology and Conservation 3: 729–35. https://doi.org/10.1016/j.gecco.2015.03.005.

Pigot, Alexander L., Joseph A. Tobias, and Walter Jetz. 2016. “Energetic Constraints on Species Coexistence in Birds.” PLOS Biology 14 (3): e1002407. https://doi.org/10.1371/journal.pbio.1002407.

5.2 Data Sources

BBS Data (2018): USGS ScienceBase Catalog
BBS Routes Geometry: Stanford EarthWorks
NDVI (June 2018): NASA Earth Observations (NEO)
Elevation: NASA Digital Elevation Model (NASADEM)
US Boundaries: US Census Bureau TIGER/Line Shapefiles