Environmental and Spatial Predictors of Bird Diversity across the US
Spatial Analysis of BBS Data with Python🐍
Spatial Data
Birds
Python
Author
Matěj Tvarůžka
Published
February 16, 2026
1 Introduction
One of the fundamental questions in biology is: What is the origin of biodiversity? Why are there so many distinct species at the same site? These questions can be examined through evolutionary and ecological perspectives, which are often connected. The ecological approach is crucial because it links the evolutionary history of coexisting species with the environments in which they occur and examines the factors that allow not just the build-up of local (and consequently regional) species richness but also its maintenance over time. One factor that facilitates species coexistence in birds and therefore drives their species richness (number of species at the site) on the regional scale is environmental primary productivity (Pigot, Tobias, and Jetz 2016).
At broad spatial scales, species richness typically increases with productivity (Hawkins, Porter, and Diniz-Filho 2003). In contrast, at finer spatial scales, the relationship between productivity and species richness is more complex: positive, negative, hump-shaped, and U-shaped relationships have all been documented (Mittelbach et al. 2001). There are several hypotheses for this relationship between productivity and species richness, but we still do not know the ultimate mechanism behind it (Di Cecco et al. 2022). In this assignment, I will examine how vegetation productivity drives species richness in birds across North America.
I will use the Breeding Bird Survey (BBS) dataset, which is one of the most significant and longest-running “citizen science” programmes in the world. Established in 1966, its primary goal was to monitor the status and trends of North American bird populations. This census operates at the landscape scale, with routes about 40 km long and 50 stops where volunteers count birds. As a measure of vegetation primary productivity and environmental heterogeneity, I decided to use the Normalised Difference Vegetation Index (NDVI), which is commonly used as a proxy for primary productivity in both regional (Nieto, Flombaum, and Garbulsky 2015) and local studies (Brüggeshemke and Fartmann 2025).
NoteResearch Questions
How does vegetation density/heterogeneity predict bird diversity across the United States?
How spatial predictors explain the patterns of species richness in North American birds?
NoteTasks to cover
Prepare and explore the data - clean BBS routes, extract NDVI for route geometry, and get elevation data with Earth Engine.
Evaluate global and local spatial autocorrelation in species richness and model residuals.
Build the OLS model and compare it with GWR model.
View Code
import geopandas as gpdimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport rioxarrayimport xvecimport foliumimport eeimport geemapimport esdaimport contextilyfrom libpysal import graphimport statsmodels.formula.api as smffrom mgwr.gwr import GWRfrom mgwr.sel_bw import Sel_BWfrom sklearn.metrics import mean_squared_error
2 Data preparation
2.1 Study area
I chose the lower 48 states of the USA as the study area because their route geometries are publicly available. In the basic BBS dataset, the geometry is included only for the starting points of the routes, which are on average about 40 km long, so the single point does not optimally represent their geometry. For basic spatial data preparation, we will use the geopandas library. First, we need to define the boundary of the study area.
View Code
us_states = gpd.read_file("study_area/cb_2018_us_state_500k.shp")exclude = ["AK", "HI", "PR", "VI", "MP", "GU", "AS"] # States outside the area of interestus_states = us_states.query("STUSPS not in @exclude") # Accessing the exclude variable outside the query() stringus_boundary = us_states.dissolve()
2.2 BBS data
Now it is time to load the data about species richness, which consists of the number of species detected at the route. I decided to use data from 2018, which I already prepared in R, and computed species richness for each transect. This data has one geometry column with the coordinates of the starting point of the route. I downloaded the data at the official project site.
View Code
bbs = gpd.read_file("bbs/bbs_routes_SR_2018.gpkg", engine ="fiona")bbs = bbs.loc[bbs["CountryNum"] ==840] # Select only US transectsbbs = bbs.loc[bbs["StateNum"] !=3] # Select every state except Alaskabbs["RTENO"] = bbs["RTENO"].astype(str).str[3:] # Slice the string (skips the first 3 charactersbbs.head()
Unfortunately, the route geometries are often inaccurate and sometimes consist of several fragments, which can cause problems. Therefore, we first need to remove potentially wrong route geometries. The expected route geometry is about 40 km, so let us filter potential outliers from this value.
Important
This cleaning will result in the loss of several hundred data points, but we need precise route geometries for data extraction.
View Code
routes = routes.to_crs(epsg=5070) # First set the crs to USA Contiguous Albers Equal Area Conic projectionroutes["calc_length_km"] = routes.geometry.length /1000# Calculating the route length for given geometryroutes = routes.loc[routes["RTELENG"] <=45000,] # Selecting routes under 45 km in length by the dataset measuremask = (routes["calc_length_km"] >=35) & (routes["calc_length_km"] <=45) # Filter out fragments and over-long routesroutes_filtered = routes[mask].copy()print(f"Original routes count: {len(routes)}")print(f"Filtered routes count (35-45km): {len(routes_filtered)}")print(f"Average route length: {routes_filtered["calc_length_km"].mean():.2f} km")
Original routes count: 3062
Filtered routes count (35-45km): 2570
Average route length: 40.14 km
Let us perform the spatial join of these two GeoDataFrames. To combine BBS data with route geometries, we use a spatial join with “inner” because we need matching values in both tables. I could also join the tables on the “RTENO” column, but I want to be sure the geometries match.
View Code
bbs = bbs.to_crs(routes_filtered.crs) # Make sure it's in the same coordinate systembbs = bbs.drop(columns ="RTENO")bbs["geometry"] = bbs.buffer(400) # Add a small buffer (400 m) to points to account for GPS inaccuracyjoined_data = gpd.sjoin(routes_filtered, bbs, how="inner", predicate="intersects") # Spatial join of bbs points data with the routes geometriesprint(f"Routes in subset: {len(routes_filtered)}")print(f"Routes with data: {len(joined_data["RTENO"].unique())}")
Routes in subset: 2570
Routes with data: 1611
The large drop in the number of routes is due to missing BBS data for the route geometries, which could be due to annual differences in the census - in some years, particular routes are not counted due to weather conditions, terrain obstacles or illness of the observer. However, the number is still very large, so there are probably some other errors and inaccuracies in geometry matching. Now we will do the final preparation of the table for the NDVI and elevation values extraction. We need to get rid of redundant columns, create buffer around the routes, add the route midpoint and create Lat/Lon columns for external API.
View Code
bbs_routes = joined_data[["RTENO", "species_richness", "geometry"]].copy() # Add .copy() here to break the link to joined_databbs_routes["buff_3000"] = bbs_routes.buffer(3000) # Creating 3 km buffer - buffer choice explained in NDVI extraction partbbs_routes["route_midpoint"] = bbs_routes.geometry.interpolate(0.5, normalized=True) # Calculate the midpoint (50% along the line) as a reference geometrybbs_routes = bbs_routes.drop(columns ="geometry") # We no longer need routes geometrytemp_gdf = bbs_routes.set_geometry("route_midpoint").to_crs(4326)bbs_routes["lon"] = temp_gdf.geometry.xbbs_routes["lat"] = temp_gdf.geometry.ybbs_routes = bbs_routes.set_geometry("buff_3000")bbs_routes.head()
RTENO
species_richness
buff_3000
route_midpoint
lon
lat
0
2072
63
POLYGON ((782767.664 1352804.909, 782613.87 13...
POINT (799058.576 1358112.615)
-87.165585
34.947661
2
2204
54
POLYGON ((911518.247 1350136.581, 911484.644 1...
POINT (920348.359 1362929.663)
-85.828469
34.882272
3
2073
65
POLYGON ((883221.105 1343004.45, 883424.384 13...
POINT (896739.418 1353510.697)
-86.098287
34.820791
4
2071
61
POLYGON ((766029.847 1327436.538, 765484.087 1...
POINT (776778.876 1331128.227)
-87.437364
34.725655
5
2007
63
POLYGON ((869910.569 1340151.781, 869882.814 1...
POINT (879587.76 1341112.337)
-86.300271
34.726327
2.3 NDVI raster
The Normalised difference vegetation index (NDVI) indicates vegetation health within raster image pixels by quantifying the amount of near-infrared (NIR) light plants reflect. Healthy vegetation reflects a higher proportion of NIR and a lower proportion of red light than stressed vegetation or non-living surfaces with similar visible colours (for example, turf fields). This contrast makes NDVI a practical tool for evaluating vegetation health in raster images relative to their surroundings.
NDVI is calculated for each pixel with the following calculation:
\(\Large NDVI = \frac{(NIR - Red)}{(NIR + Red)}\)
This formula generates a value between -1 and +1. Low reflectance in the red channel and high reflectance in the NIR channel will yield a high NDVI value (healthy vegetation), while the inverse will result in a low NDVI value (unhealthy vegetation). Negative values typically represent non-vegetation, such as water or rock.
The NDVI raster for the whole area would be very large and computationally demanding. Therefore, I decided to choose a different approach and work with a compressed 8-bit raster (values from 0 to 255) from NASA Earth Observations (NEO), which represents a 1-month average for June. The values in this file have been scaled and resampled for visualisation in NEO and are not suitable for rigorous scientific analysis. However, they may be used for basic assessments and for identifying general trends, which is exactly the goal of this study. We only need to resolve the differences in value scaling. We will extract the mean NDVI values for a 3 km buffer; these distances have been shown to preserve higher predictive power for covariates in studies of breeding birds (Byer et al. 2025). For manipulating raster data, we will use the libraries rioxarray (for reading raster files), xvec (for data extraction and zonal statistics computation), and folium for map visualisations.
View Code
ndvi = rioxarray.open_rasterio("MOD_NDVI_M_2018-06-01_rgb_3600x1800.TIFF")ndvi_us = ndvi.rio.clip(us_boundary.to_crs(ndvi.rio.crs).geometry) # Matching CRS and cliping to study areandvi_us = ndvi_us.drop_vars("band").squeeze() # Getting rid of redundant dimensionndvi_us = ndvi_us.where(ndvi_us>0) # Filtering null valuesprint(f"Raster CRS: {ndvi_us.rio.crs}") # Check the CRS to see if it's in degrees (4326) or meters
Raster CRS: EPSG:4326
View Code
_ = ndvi_us.plot()
We can see from the map that there is a very high NDVI on the east coast compared to the west of US. Now we can extract values from the raster for the routes buffer we created earlier. We will extract: 1) Mean value of NDVI for each buffer 2) Standard deviation of NDVI for each buffer - it will help estimate the heterogeneity of the vegetation.
View Code
ndvi_us = ndvi_us.rio.reproject(bbs_routes.crs) # Reprojecting to EPSG: 5070ndvi_extraction = bbs_routes.to_crs(ndvi_us.rio.crs)zonal_stats = ndvi_us.drop_vars("spatial_ref").xvec.zonal_stats( # Extracting values for NDVI from buffers geometry=ndvi_extraction.buff_3000, x_coords="x", y_coords="y", stats=["mean", "std"],)zonal_stats.head()
Let us check if there are some missing values in the data.
View Code
ndvi = zonal_stats.xvec.to_geodataframe(name="NDVI") # Transforming zonal statistics into GeoDataFrameprint(ndvi.isna().sum()) # Returns the count of NaNs for every column in the dataframe
geometry 0
index 0
NDVI 66
dtype: int64
View Code
ndvi_clean = ndvi.dropna(subset=["NDVI"]) # Dropping rows where the NDVI value is missingndvi_stats = ndvi_clean.reset_index().pivot( index="index", columns="zonal_statistics", values="NDVI") # This creates a table with index as rows and mean/std/min/max as columnsndvi_stats.columns = ["ndvi_"+ col for col in ndvi_stats.columns] # Add the "ndvi_" prefix to the new columnsgeometries = ndvi_clean[["index", "geometry"]].drop_duplicates("index").set_index("index") # Get the unique geometries for each indexndvi_data = geometries.join(ndvi_stats).reset_index() # Join them togetherndvi_data.head()
index
geometry
ndvi_mean
ndvi_std
0
0
POLYGON ((782767.664 1352804.909, 782613.87 13...
231.666672
7.586538
1
2
POLYGON ((911518.247 1350136.581, 911484.644 1...
243.000000
7.874008
2
3
POLYGON ((883221.105 1343004.45, 883424.384 13...
252.500000
2.598076
3
4
POLYGON ((766029.847 1327436.538, 765484.087 1...
201.500000
2.500000
4
5
POLYGON ((869910.569 1340151.781, 869882.814 1...
235.250000
20.728905
Although the NDVI data were retrieved in 8-bit format (0–255), we can standardise them to Z-scores prior to modelling. This process centres the data and removes the original units, ensuring that the model coefficients are not biased by the satellite imagery’s original bit-depth.
View Code
# Standardise the Mean (Relative Productivity)ndvi_data["ndvi_mean_z"] = (ndvi_data["ndvi_mean"] - ndvi_data["ndvi_mean"].mean()) / ndvi_data["ndvi_mean"].std()# Standardise the Std (Relative Heterogeneity)ndvi_data["ndvi_heterog_z"] = (ndvi_data["ndvi_std"] - ndvi_data["ndvi_std"].mean()) / ndvi_data["ndvi_std"].std()
Now that we have standardised our data, we can perform the spatial join with the bbs_routesGeoDataFrame.
View Code
ndvi_to_join = ndvi_data[["ndvi_mean_z", "ndvi_heterog_z", "geometry"]]buffer= bbs_routes.set_geometry("buff_3000")bbs_ndvi = gpd.sjoin(buffer, ndvi_to_join, how="inner", # Keeps only rows that exist in both sets predicate="intersects"# Ensures the polygons must overlap/be the same)bbs_ndvi = bbs_ndvi.drop(columns="index_right") # Dropping index_rightdata = bbs_ndvi.drop_duplicates(subset=["RTENO"]) # Check for duplicatesdata.head()
RTENO
species_richness
buff_3000
route_midpoint
lon
lat
ndvi_mean_z
ndvi_heterog_z
0
2072
63
POLYGON ((782767.664 1352804.909, 782613.87 13...
POINT (799058.576 1358112.615)
-87.165585
34.947661
0.782085
-0.136629
2
2204
54
POLYGON ((911518.247 1350136.581, 911484.644 1...
POINT (920348.359 1362929.663)
-85.828469
34.882272
0.997218
-0.101972
3
2073
65
POLYGON ((883221.105 1343004.45, 883424.384 13...
POINT (896739.418 1353510.697)
-86.098287
34.820791
0.850105
1.447823
4
2071
61
POLYGON ((766029.847 1327436.538, 765484.087 1...
POINT (776778.876 1331128.227)
-87.437364
34.725655
0.209451
-0.749866
5
2007
63
POLYGON ((869910.569 1340151.781, 869882.814 1...
POINT (879587.76 1341112.337)
-86.300271
34.726327
0.850105
1.447823
Now we can take a look at the map of relative NDVI and see if it coresponds with our expectations.
View Code
m = data.explore( column="ndvi_mean_z", cmap="RdYlGn", marker_kwds={'radius':5}, vmin=-2.5, vmax=2.5, legend=True, tooltip=["ndvi_mean_z"], tiles="CartoDB positron", popup=True# Click to see all data for that route)m
Make this Notebook Trusted to load map: File -> Trust Notebook