Predicting Yards Passing for NFL Quarterbacks Using Machine Learning: Part 1

By Nolan Johnson
Generative AI Intern BSC Analytics

01 Oct 2024

Part 1: Data Collection and Understanding

Introduction

Predicting NFL quarterback performance is a challenging and exciting task, especially for fans who enjoy diving deep into stats and numbers. Machine learning has opened up new ways to make these predictions more accurate and insightful. In this series, we'll explore how to predict one of the most critical quarterback stats: total passing yards. By the end of the series, you'll understand how to use various combinations of NFL quarterback statistics to forecast passing yards using machine learning regression models.

The goal of this series is not just to build a model but to explore each critical step, from data collection to model evaluation, to understand the predictive power of various QB stats. In this first article, we'll focus on laying the foundation---defining the question, finding reliable data, understanding its structure, and ensuring it's ready for model training.

Defining the Question

To begin any predictive analysis, we need a well-defined question. As NFL fans, we know that total yards passing is one of the most important stats when evaluating a quarterback's performance. However, it's crucial to determine which other statistics are most predictive of passing yards. Obviously, we know that the combination of passing attempts and yards per attempt correlates perfectly with total passing yards, but we want to dig deeper. Are there other statistics that have a strong predictive relationship with total yards passing? And which combination of stats offers the most accurate predictions?

This last question will guide our entire analysis, and, as we progress through the series, we'll refine our models to answer it as effectively as possible.

Finding the Data

The foundation of any successful machine learning project is high-quality data. Without reliable data, even the best models will produce poor results. In our case, we need historical NFL quarterback stats that include a wide range of metrics, from passing attempts and completions to interceptions and touchdowns.

There are many places to find NFL data, but I've found Kaggle to be an excellent resource for data science projects. Kaggle offers a variety of datasets, often with detailed explanations, making it easier to understand the context of the data. For this project, I chose the dataset "NFL QB Stats 1970-2022," available here. This dataset contains quarterback statistics from over five decades, including the key metrics we need to answer our question.

Understanding the Data

Once we have the data, the next step is understanding its structure. Knowing how the data is distributed helps us choose the right model and ensures we clean the data appropriately. Using Python's Pandas library, we can begin by exploring basic descriptive statistics, such as the mean, median, and standard deviation for each variable.

The two main questions to address when analyzing any dataset are:

How are the data distributed?
What data needs to be cleaned?

To answer the first question, we'll use visualizations. Tools like Pyplot and Seaborn are great for this. I started by plotting histograms to visualize the distribution of key statistics like passing yards, completions, and attempts. This gave me a clearer picture of how QB stats are distributed. For example, I found that most quarterbacks' attempts are clustered around the median, with a few outliers where QBs either had very few or an exceptionally high number of attempts.

Visualizing the Data

Visualizing the data allows us to spot patterns and anomalies that may not be obvious from raw numbers alone. For instance, here's a sample histogram of the passing yards distribution from the dataset, including the code I used in Pyplot to visualize our data followed by the resulting output histogram:

From the plot, we can see that the distribution looks roughly normal but with some outliers. These outliers indicate we may need to filter or clean the data before building our model. But first, we need to decide which features or statistics are most relevant to our analysis.

Key Statistics for Analysis

We know that passing yards are a function of several other quarterback statistics, such as attempts, completions, and interceptions. But are these the most important? In our analysis, we will focus on a combination of statistics that are most likely to predict total passing yards effectively.

Here are a few key stats we'll be using:

Attempts: The total number of times the QB threw the ball.
Completions: How many of those attempts were successful.
Interceptions: How many times the opposing defense intercepted the ball.
Touchdowns: Scoring plays via a pass.

These statistics, along with others, will be part of our exploratory analysis to determine which variables have the strongest correlation with total passing yards.

Data Cleaning

Once we understand the data, the next step is cleaning it. Cleaning data ensures that we're working with accurate, relevant information, which directly impacts the performance of our machine learning model. In this case, I noticed that some quarterbacks played very few games or attempted very few passes. Including these players would skew the results, so I decided to set a cutoff point: only quarterbacks with at least 100 completions would be included in the analysis.

Cleaning the data is a critical step because the quality of our model depends on it. Removing irrelevant or noisy data helps the model focus on the most meaningful patterns. By filtering out these low-completion QBs, we ensure that the remaining data is more representative of quarterbacks who had a significant impact on the game.

Conclusion

In this first part of our series, we've laid the groundwork by defining the question, finding a reliable dataset, and cleaning the data to prepare it for analysis. Data collection and understanding are crucial steps because they form the foundation for building an accurate and effective machine learning model.

In the next part, we'll dive into model selection and training, where we'll explore which machine learning models are best suited to predict passing yards based on our cleaned data. Stay tuned as we continue our journey into the world of NFL stats and machine learning!

Machine Learning, AI and ML

Predicting Yards Passing for NFL Quarterbacks Using Machine Learning: Part 1

Part 1: Data Collection and Understanding

Related Posts

Related Articles

Inter-Region WireGuard VPN in AWS

Making PDFs Searchable Using AWS Textract and CloudSearch

Slack AI Bot with AWS Bedrock Part 2

Contact Us