60 down, 102 left unplayed: A look at the 2020 MLB season and the batting statistics within.
Updated: Dec 27, 2020
It starts with pitchers and catchers reporting to spring training, ends with the last out of the World Series, and it's a schedule with which most baseball fans are familiar. From February to October, games are played, standings are changed, data is gathered, and a champion is crowned. In 2020, that title went to the Los Angeles Dodgers, who won their 7th World Series, and first since 1988. True to schedule, the final out of the season was recorded on October 27th. However, what came before was anything but expected.
As is well-known by now, the 2020 MLB season was shortened to 60 games, due to the COVID-19 pandemic. This unprecendented move provided opportunity to view a season's worth of statistics in a different lens. Which batting statistics were the most affected by the shortened season, which statistics were seemingly unchanged?
The statistics being looked at are listed below. For my analysis, I only focused on the 2018-2020 seasons, to help reduce confounding factors resulting from changes in gameplay or strategy over time.
Strikeout Percentage (K%)
Walk Percentage (BB%)
Batting Average (BA)
Slugging Percentage (SLG)
On-Base Percentage (OBP)
On-Base Plus Slugging (OPS)
Isolated Power (ISO)
Weighted On-Base Average (wOBA)
Expected Batting Average (xBA)
Expected Slugging (xSLG)
Expected On-Base Percentage (xOBP)
Expected Isolated Power (xISO)
Once I had gathered my data, taken directly from Baseball Reference, I used the pandas library in Python to create tables for the players who played in both 2018 or 2019 and 2020. The 2018/2019 table calculated averages for each statistic by player if they played in both years.
Using the matplotlib library in Python, I put together scatterplots with the 2020 season on the X-axis, and the 2018/2019 seasons on the Y-axis.
While there may not be immediately noticable strong correlation, I think these graphs paint a very interesting pitcture, namely of how much individual player performance can vary from year-to-year. Sure, there are players like Oakland's Khris Davis, who once hit .247 in 5 consecutive seasons, but that's certainly not to be expected. Indeed, we would expect to see similar scatter plots even if the 2020 season were to have been a full 162 games.
To dive deeper into the correlations between the years I used pandas corrwith method to output a correlation table of the statistics from one table to the statistics of the other. For example, the correlation for batting average in 2018/2019 and 2020 is .260, suggesting a weakly positive correlation.
The first notable observation is that all of these statistics in some way or another exhibited a positive correlation. The statistic that seemed to translate best from season to season is the K%. This makes sense, as most other batting statistics are based on the various possibilities of a batted ball, while K% is set in a much more binary "hit the ball" or "don't hit the ball". On the other hand, BA was the least consistent statistic from 2018/2019 to 2020.
Statistical Analysis: Paired T-Tests
Before carrying out post-hoc analysis on the data, I used the Shapiro-Wilk test to ensure that all statistics in question could be assumed as normal. As it turns out, BB% and xOBP can not be assumed to be normally distributed, as such I chose not to continue with these statistics.
I chose to run Paired T-Tests on the remaining statistics, as it tests for mean difference between two samples. The statistics and their respective p-values are below, those in green indicate significant differnces, all else in red (alpha value of 0.05).
According to the t-test, the average of 4 statistics changed significantly between 2018/2019 and 2020: K%, BA, SLG, and xBA.
To illustrate just how small a change in means can result in significance, I'll leave a table below that shows the difference in means of the four significant statistics from 2018/2019 to 2020, expressed as a percentage.
Thoughts and Conclusions
The shortening of the MLB season in 2020 gave the chance for analysts to uncover significant discrepencies among year-to-year performance for players. A long examined phenomenon in baseball is hitters being "streaky", or simply better or worse than their typical averages, but only for short lengths of time. Thusly, playing a season that is roughly a third of the length of the typical season gave chance for players to overperform or underperform their previous seasons, without having a number of games large enough to allow for regression to the mean.
Why It Mattered
Firstly, hitting a ball thrown by a major league pitcher is, in a word, hard. With the early cancellation of Spring Training, and less time to prepare, that act became even harder. The Strikeout Rate in 2020 is the second highest ever recorded in a major league season, only behind 2019, and the Batting Average is the MLB's 12th lowest on record, with the 11 other seasons preceding 1972. Simply put, hitters struggled.
Secondly, BABIP, or Batting Average of Balls In Play (a measure of Batting Average that only takes into account the average when a player successfully puts a ball in play), was at a near all-time low this year. This indicates, in theory, the conjunction of improved fielding by position players on defense, and a decrease in quality contact of the ball at the plate. Digging deeper into it, improved fielding would directly correlate a decrease in xBA, while a decrease in quality contact would lead to a decrease in SLG. Interestingly, these are two of the four significant statistics.
Thirdly, with fewer games, teams faced each other fewer times. This led to fewer occassions of a batter facing a pitcher he had seen previously during the season. The percentage of at-bats in the 2020 season where a batter had seen a pitcher three times was down from 13% in 2019 to below 8.5%.
Why It Didn't Matter
The statistics that failed to return significant results have a few things in common. Firstly they all depend on more factors than simply successfully getting hits. ISO takes into account SLG indpendent of Batting Average, and xISO adds on a consideration for a batted ball's exit velocity. xSLG and OPS are directly tied to SLG, and wOBA assigns values to each method of reaching base, to properly weight the importance of a plate appearance.
Secondly, the high amount of HRs hit per team per game in 2020 (1.28) was only slightly below the record-high of 2019 (1.39), so while the statistics grounded more in power than pure hitting may have been affected, it wasn't enough to lead to significant change.
Actual vs. Expected
Curiously, while SLG was considered a statistically signficant variable, xSLG was not. A possible explaination is that xSLG is based not only on the bases a batted ball resulted in, but the expectation of bases based on the ball's exit velocity and launch angle, and the batter's sprint speed. These additional metrics allow for a more robust formula that's unlikely to be as changed over the course of 3 seasons.
If you are interested in any of the raw data tables and code used in this post, feel free to visit my GitHub repository on this topic.