Late Again

An attempt to predict the delay rate of Newark flights.

Eric Kleen
4 min readDec 3, 2020
Photo by Erik Odiin on Unsplash

Introduction

We’ve all been there…after waiting for two hours at the crowded gate we hear the announcement that our flight will be delayed. If you are anything like me, you have come to expect this to happen almost 100% of the time. Because I travel mostly from one of the busiest airports in the nation, Newark Liberty Airport (EWR), I wanted to see if I could prove my hypothesis. The story below is a description of the journey I took to determine how often EWR flights are delayed and attempt to predict the possibility of delay for future flights.

The data for this article is taken from the US Department of Transportation website and can be found along with my Jupyter Notebook on my Github repository here. The data includes all 2019 flights arriving or departing EWR.

For the purposes of brevity, this story uses the term “delayed” to refer to flights delayed at least 30 minutes or cancelled. I felt that a delay under 30 minutes, while annoying, is less likely to cause serious damage to follow on travel plans.

In order to build a prediction model, I first needed to understand how different attributes of a scheduled flight affect the potential for delay so I began to analyze the data with these questions in mind.

Q1: What is the overall rate of delayed/cancelled flights departing EWR?
Q2: What is the rate of delayed/cancelled flights departing EWR for each airline and which airlines have the most influence on the overall delay rate?
Q3: Is there monthly/weekly/daily seasonality?

What is the overall rate of delayed/cancelled flights departing EWR?

Figure 1: 2019 Delay Rate of Departing Flights

While I felt that my flights are almost always delayed, I expected the true rate to be less than 10% of the time so I was a bit shocked to find out that almost a fourth of the flights leaving EWR actually are delayed.

What is the rate of delayed/cancelled flights departing EWR for each airline and which airlines have the most influence on the overall delay rate?

Figure 2: Delay Rate by Airline
Figure 3: Total Flights by Airline

While Frontier Airlines has the worst delay rate, it also has the least total flights. It’s pretty easy to see that United Airlines makes up the majority of flights and therefore the overall delay rate of all EWR flights is primarily dependent on United’s performance. Most importantly, it’s clear that the Airline is an influencing factor on the probability that a flight will be delayed.

Is there monthly/weekly/daily seasonality?

Figure 4: Delay Rate by Month
Figure 5: Delay Rate by Weekday
Figure 6: Delay Rate by Time of Day

Again, based on the results of this analysis, each of these factors appears to have an affect on the likelihood of cancellation.

Deploying the Model to Predict the Future

Based on what I learned from the previous section, I attempted to build a binary classifier using the Gradient Boosting Classifier Module from Python’s Scikit-Learn library because I’ve had good luck with this algorithm in the past. I used only Airline, Month, Weekday, and Time-of-Day as a feature set and trained the model using 80% of the total data. Using this model, I then entered the details of my own upcoming flight to get a prediction of “On Time” or “Delayed”. I was quite relieved when the model returned a prediction of “On Time” with a 79% confidence. The true test came a week later when I actually took the trip and sure enough…on time!!!

Conclusions

Throughout this project I explored the data for EWR 2019 flights and learned many interesting facts like Nov has a lower rate of delays than June. However, the overall performance of the model is only slightly better than just looking at the average delay rate of all flights. I believe additional feature selection is needed to enhance the model, possibly factors such as destination or length of previous leg for the same tail number would be important.

--

--