Project Goal: Explore whether referee travel distance influences officiating behavior during NCAA Division I Men's Basketball games.
Tech Stack: Python (requests, beautifulsoup, geopy), R (tidyverse, ggplot2, plotly), R Shiny
In college basketball, high-stakes games demand fairness. But what if unseen factors influence outcomes? This project began with a simple question: Does referee travel distance affect officiating behavior? Before answering that, I had to build the data infrastructure from the ground up.
Unlike modern APIs, stats.ncaa.org is a static HTML site with no structured access points. Every game, box score, and referee assignment is embedded in hard-to-navigate pages. The breakthrough came from discovering that each game had a unique ID buried in the URL. This became the anchor for scraping a full season's worth of data.
Using custom Python scripts, I automated the collection of game IDs by date and extracted structured details for each matchup. This included team stats, locations, and referee crews. The final dataset consolidated 5,922 games across the 2024–25 season with 34 structured columns and over 770 referees.
To measure travel, I geocoded venue names into latitude and longitude coordinates using Nominatim (OpenStreetMap). I then calculated straight-line distances between each official's assignments across the season. This helped reconstruct a proxy travel itinerary for every referee.
To make the data actionable, I developed an R Shiny dashboard to visualize referee workload and travel behavior. It provides real-time exploration of referee patterns and flags outliers in assignment strategy.
A custom Python class was developed to: