Football Matches are notoriously hard to predict due to the random and low scoring nature of the games. One of the prevailing nuggets of wisdom however, is that home teams win more on average, and this has been statistically backed up.
The plot below shows the outcome of all games, Home wins occur 46% percent of the time. This means a model that just predicts home wins should be right app. 46% of the time.
In recent years, more advanced metrics have been proposed to account for the randomness of results. An example is xG (Expected Goals), which is the probability that a shot will lead to a goal based on factors including events leading to the shot, location of the shooter, body part used to shoot, type of pass that led to the shot, and the type of attcking event that led to the shot. The shot is compared to historical data of same shot, and how often it gets converted.
Even with all these metrics, it is still quite hard to predict game outcomes, which is why the betting industry is worth billions of dollars. The betting companies have advanced models to generate odds, even though they adjust them a little to make sure the house always wins.
The idea behind this project is to use the closing odds from various gambling companies, and buttress it with team form (using rolling averages of n previous games for various performance metrics), transfer spending, and fifa ratings.
The data I used for this project was downloaded from football-data.co.uk, fivethirtyeight.com, fbref.com, and fifaindex.com. A lot of the work involved was getting the data, scraping the data and putting it together to make it structured.
For this project I decided to use R data.table, because I love how blazing fast they are and it’s one-liner approach to data manipulation, although it is quite unneccesary for the size of data involved in this project, however I plan to include other leagues in the future, and I can see the data getting big (Big data?) very quickly.
The 2 custom functions below are used to create the confusion matrix plot and perform rolling means for the match stats
#All functions
#Takes confusion_matrix object and plots it with important metrics
plot_confusion_matrix = function(cm ){
autoplot(cm, type = "heatmap")+
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
ggtitle(paste0("Accuracy = ", format(round(summary(cm)[[".estimate"]][1], 2), nsmall = 2),
" Sensitivity = ", format(round(summary(cm)[[".estimate"]][3], 2), nsmall = 2),
" Specificity = ", format(round(summary(cm)[[".estimate"]][4], 2), nsmall = 2))) +
theme(plot.title = element_text(hjust = 0.5))
}
##Custom function - Find the rolling mean of previous n elements
shift_froll = function(x, n){shift(frollmean(x, n= n))}After the data has been reshaped to calculate team forms by doing rolling avergages, variable analysis needs to be done.
Correlation plot below shows intercorrelations between predictors. This is the first step in variable analysis, and from the plot it makes sense that half time goals and full time goals have a strong correlation. BbAvH (Betting odds for the home team) also has moderate correlations with full time goals rolling average and total shots rolling average. This is intuitive because the stronger teams tend to take more shots, which increases the chances of 1 or 2 going in.
It is also important to understand the distributions of the variables, although it is not necessary to normalize the distrubution before PCA. However, it is important to standardize the data because the variables are all of different units ans the values have differences of orders of magnitude.
The scree plot displays how much variability of the data is captured in each dimension. The scree plot is somewhat subjective, but a visual observation suggests we can use the first 4 dimensions for prediction even though it only captures about 60% of the dataset varience
The plot below shows the contribution of each variable to the first 2 dimensions. The betting odds and fifa ratings are the biggest contributors.
fviz_pca_var(pcaEPL,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)The PCA conribution plot is useful in choosing the variables to use as predictors. The standard contribution of a variable shoud be 3.125% due to the fact that there are 32 variables in the dataset. Across 4 dimensions, any variable that contributes more than 12.5 (4 x 3.125), can be regarded as important. The cutoff of 12.5 has been added to the plots to show which variables would be included in the prediction models.
Classification trees have become very popular in recent years, they are very versatile and the results are easy to understand.
#Classification Tree
EPLTree <- rpart(FTR~ AFifaATT+AFifaMID+AFifaDEF+AFifaOVR+ASpend+BbAvA+BbAvD+BbAvH+BbMxA+BbMxD+BbMxH+HFifaDEF+HFifaMID+HFifaOVR+HSpend+STrAvg+STrAvgAway, data= ModelData, method = "class")
fancyRpartPlot(EPLTree)The confusion matrix shows an in-sample accuracy of 55%, this could perform terrribly out of sample though. The model could be highly biased, therefore we have to split the data to see how our model performs on new data.
Before that we also want to perform cross-validation to figure out the complexity parameter that produces an out of sample tree with the highest accuracy. From the plot a cp(complexity parameter) value of 0.008 gives the highest cross-validated accuracy OF ABOUT 0.53. Using this value, we can generate the tree and predict using test data culled from the dataset.
Surpsrisingly, the model performs just as well out of sample. A 55% prediction accuracy is very good. A betting model that wins 55% of the time should provide a high ROI if a user decides to put money on all matches as a form of long term investment
The logical extension of the tree model is the random forest. which is basically a tree of multiple trees. The downside to the random forest approach is that, it can be a little hard to interpret. On the flip-side, although it typically performs better than single trees
The random forest is surprisingly not as accurate as the tree. It only predicts the right result 54% of the time.
The Naive Bayes classifier assumes no intercorrelation between the predictors. I expect this to be the worse performing of all the models, because we already know the underlying assumption to be wrong.
Multinominal logistic regression is a generalization of logistic regression for multiclass problems. This seems appropriate because there is no requirement for independent variables to be independent from each other. This is also an attractive model to use, because it does not assume normality.
## # weights: 57 (36 variable)
## initial value 4572.424345
## iter 10 value 4159.260180
## iter 20 value 4092.925695
## iter 30 value 4002.801363
## iter 40 value 3982.732878
## final value 3982.729391
## converged
Linear Discriminant Analysis(LDA) is used to predict.The LDA model is more suitable for multiclass problems, although a few assumptions are made. The LDA assumes a normal data distribution, and a linear combination of predictors to predict outcome. From the histogram plot above, we can see that a few of the variables show a skewed distribution. The log of the data is taken to get them as close to normal as possible.
Now that we have the desired distributions we wanted, we can perform LDA, and generate the truth table
The Quadratic Discriminant Analysis (QDA) model is similar to the Linear Discriminant Analysis, except we assume no common variance among our predictors.
After trying various models for prediction, I couldn’t do better than a 55% out of sample prediction accuracy. The betting odds dominate the models as expected, due to the fact that the complex algorithms used by the multi-billion dollar gambling industry already accounts for team performance metrics. In the future, I plan to incorporate other variables into my models like injuries, xG, team standing before match, managerial changes, fantasy premier league performance, and even weather. The greatest obstacle is data availability, and collection.