Exploratory Data Analysis

Football Matches are notoriously hard to predict due to the random and low scoring nature of the games. One of the prevailing nuggets of wisdom however, is that home teams win more on average, and this has been statistically backed up.

The plot below shows the outcome of all games, Home wins occur 46% percent of the time. This means a model that just predicts home wins should be right app. 46% of the time.

In recent years, more advanced metrics have been proposed to account for the randomness of results. An example is xG (Expected Goals), which is the probability that a shot will lead to a goal based on factors including events leading to the shot, location of the shooter, body part used to shoot, type of pass that led to the shot, and the type of attcking event that led to the shot. The shot is compared to historical data of same shot, and how often it gets converted.

Even with all these metrics, it is still quite hard to predict game outcomes, which is why the betting industry is worth billions of dollars. The betting companies have advanced models to generate odds, even though they adjust them a little to make sure the house always wins.

The idea behind this project is to use the closing odds from various gambling companies, and buttress it with team form (using rolling averages of n previous games for various performance metrics), transfer spending, and fifa ratings.

The data I used for this project was downloaded from football-data.co.uk, fivethirtyeight.com, fbref.com, and fifaindex.com. A lot of the work involved was getting the data, scraping the data and putting it together to make it structured.

For this project I decided to use R data.table, because I love how blazing fast they are and it’s one-liner approach to data manipulation, although it is quite unneccesary for the size of data involved in this project, however I plan to include other leagues in the future, and I can see the data getting big (Big data?) very quickly.

The 2 custom functions below are used to create the confusion matrix plot and perform rolling means for the match stats

Variable Analysis

After the data has been reshaped to calculate team forms by doing rolling avergages, variable analysis needs to be done.

Correlation plot below shows intercorrelations between predictors. This is the first step in variable analysis, and from the plot it makes sense that half time goals and full time goals have a strong correlation. BbAvH (Betting odds for the home team) also has moderate correlations with full time goals rolling average and total shots rolling average. This is intuitive because the stronger teams tend to take more shots, which increases the chances of 1 or 2 going in.

It is also important to understand the distributions of the variables, although it is not necessary to normalize the distrubution before PCA. However, it is important to standardize the data because the variables are all of different units ans the values have differences of orders of magnitude.

Principal Component Analysis

The scree plot displays how much variability of the data is captured in each dimension. The scree plot is somewhat subjective, but a visual observation suggests we can use the first 4 dimensions for prediction even though it only captures about 60% of the dataset varience

The plot below shows the contribution of each variable to the first 2 dimensions. The betting odds and fifa ratings are the biggest contributors.

Variable Selection

The PCA conribution plot is useful in choosing the variables to use as predictors. The standard contribution of a variable shoud be 3.125% due to the fact that there are 32 variables in the dataset. Across 4 dimensions, any variable that contributes more than 12.5 (4 x 3.125), can be regarded as important. The cutoff of 12.5 has been added to the plots to show which variables would be included in the prediction models.

Prediction Models

CART Tree with k-Fold cross validation

Classification trees have become very popular in recent years, they are very versatile and the results are easy to understand.

The confusion matrix shows an in-sample accuracy of 55%, this could perform terrribly out of sample though. The model could be highly biased, therefore we have to split the data to see how our model performs on new data.

Before that we also want to perform cross-validation to figure out the complexity parameter that produces an out of sample tree with the highest accuracy. From the plot a cp(complexity parameter) value of 0.008 gives the highest cross-validated accuracy OF ABOUT 0.53. Using this value, we can generate the tree and predict using test data culled from the dataset.

Surpsrisingly, the model performs just as well out of sample. A 55% prediction accuracy is very good. A betting model that wins 55% of the time should provide a high ROI if a user decides to put money on all matches as a form of long term investment

Random Forest.

The logical extension of the tree model is the random forest. which is basically a tree of multiple trees. The downside to the random forest approach is that, it can be a little hard to interpret. On the flip-side, although it typically performs better than single trees

The random forest is surprisingly not as accurate as the tree. It only predicts the right result 54% of the time.

Naive Bayes Classifier.

The Naive Bayes classifier assumes no intercorrelation between the predictors. I expect this to be the worse performing of all the models, because we already know the underlying assumption to be wrong.

Multinominal Logistic Regression.

Multinominal logistic regression is a generalization of logistic regression for multiclass problems. This seems appropriate because there is no requirement for independent variables to be independent from each other. This is also an attractive model to use, because it does not assume normality.

## # weights:  57 (36 variable)
## initial  value 4572.424345 
## iter  10 value 4159.260180
## iter  20 value 4092.925695
## iter  30 value 4002.801363
## iter  40 value 3982.732878
## final  value 3982.729391 
## converged

Linear Discriminant Analysis

Linear Discriminant Analysis(LDA) is used to predict.The LDA model is more suitable for multiclass problems, although a few assumptions are made. The LDA assumes a normal data distribution, and a linear combination of predictors to predict outcome. From the histogram plot above, we can see that a few of the variables show a skewed distribution. The log of the data is taken to get them as close to normal as possible.

Now that we have the desired distributions we wanted, we can perform LDA, and generate the truth table

Quadratic Discriminant Analysis

The Quadratic Discriminant Analysis (QDA) model is similar to the Linear Discriminant Analysis, except we assume no common variance among our predictors.

Conclusion

After trying various models for prediction, I couldn’t do better than a 55% out of sample prediction accuracy. The betting odds dominate the models as expected, due to the fact that the complex algorithms used by the multi-billion dollar gambling industry already accounts for team performance metrics. In the future, I plan to incorporate other variables into my models like injuries, xG, team standing before match, managerial changes, fantasy premier league performance, and even weather. The greatest obstacle is data availability, and collection.