I’ve always been curious to know if any of the 4 major European leagues
(Serie A, Bundesliga, Premiership, La Liga) are more predictable than
others. La Liga certainly has a reputation as being dull and
predictable, although this is due to the sheer dominance of Barcelona
and Real Madrid in recent years. I’ve increased my database of football
matches in order to improve my football prediction
bot this summer, and so now
have sufficient data to investigate. In this analysis my measure of
*predictability* is whether the favourite according to the bookies won
the match, so in essence I’m measuring the number of **upsets**. I have
access to a data set comprising full match results from the last 11
seasons (^{2004}⁄_{2005} through to ^{2015}⁄_{2016}) for the four major European
leagues, along with the prior odds assigned to each of the three
possible match outcomes. Note that the particular bookies I’m using is
William Hill, and so in theory there could be an element of bias here as
this is an English bookmakers, who presumably would have additional
expertise in the English football leagues since this will be by far
their biggest market. In the modern globalised world this could be
completely wrong as they could contract out markets in different
countries to local bookmakers, or form their odds by analysing the
market - I have no idea, but just worth bearing in mind.

## Initial look

I’ll be running my analysis in R, so I need to load some libraries as well as load my data (not shown).

```
library(dplyr)
library(ggplot2)
library(ggthemes)
```

Let’s have a quick look at the raw data (already collated in a variable
called `res`

) to check it looks sensible first:

```
knitr::kable(head(res))
```

id | league | season | homeodds | drawodds | awayodds | outcome |
---|---|---|---|---|---|---|

1 | premiership | 2012-2013 | 0.5780347 | 0.2777778 | 0.2083333 | h |

2 | premiership | 2012-2013 | 0.8000000 | 0.1818182 | 0.0833333 | h |

3 | premiership | 2012-2013 | 0.7518797 | 0.2000000 | 0.1111111 | a |

4 | premiership | 2012-2013 | 0.1818182 | 0.2309469 | 0.6535948 | a |

5 | premiership | 2012-2013 | 0.5555556 | 0.3076923 | 0.2000000 | d |

6 | premiership | 2012-2013 | 0.5555556 | 0.2777778 | 0.2309469 | a |

```
summary(res)
```

```
## id league season homeodds
## Min. : 1 bundesliga1:3366 2005-2006:1446 Min. :0.03846
## 1st Qu.: 3882 laliga :3800 2006-2007:1446 1st Qu.:0.37037
## Median : 8144 premiership:4180 2008-2009:1446 Median :0.47619
## Mean : 8198 seriea :4180 2009-2010:1446 Mean :0.48377
## 3rd Qu.:12405 2010-2011:1446 3rd Qu.:0.59880
## Max. :16286 2011-2012:1446 Max. :0.98039
## (Other) :6850
## drawodds awayodds outcome
## Min. :0.05882 Min. :0.01961 a:4358
## 1st Qu.:0.27778 1st Qu.:0.20000 d:3955
## Median :0.30303 Median :0.29412 h:7213
## Mean :0.28945 Mean :0.31188
## 3rd Qu.:0.32258 3rd Qu.:0.38462
## Max. :0.98039 Max. :0.92593
##
```

I now need to make a new column indicating whether the
bookies got the prediction right or not, stored as an integer. This will
be my dependent variable for much of the subsequent analysis, with the
`league`

factor being my principal independent variable of interest. I’m
also curious to see how this predictability varies with time, and so
I’ve included the season for each match under the `season`

predictor.

```
res_clean <- res %>%
group_by(id) %>%
mutate(predicted = c('h', 'd', 'a')[which.max(c(homeodds, drawodds, awayodds))],
correct = as.numeric(predicted==outcome))
```

Perfect, the data is in a format suitable for analysis.

```
knitr::kable(head(res_clean))
```

id | league | season | homeodds | drawodds | awayodds | outcome | predicted | correct |
---|---|---|---|---|---|---|---|---|

1 | premiership | 2012-2013 | 0.5780347 | 0.2777778 | 0.2083333 | h | h | 1 |

2 | premiership | 2012-2013 | 0.8000000 | 0.1818182 | 0.0833333 | h | h | 1 |

3 | premiership | 2012-2013 | 0.7518797 | 0.2000000 | 0.1111111 | a | h | 0 |

4 | premiership | 2012-2013 | 0.1818182 | 0.2309469 | 0.6535948 | a | a | 1 |

5 | premiership | 2012-2013 | 0.5555556 | 0.3076923 | 0.2000000 | d | h | 0 |

6 | premiership | 2012-2013 | 0.5555556 | 0.2777778 | 0.2309469 | a | h | 0 |

## Exploratory Analysis

Let’s do some quick summary statistics then. Firstly we can see the overall % of matches correctly predicted by the bookies:

```
mean(res_clean$correct)
```

```
## [1] 0.5314311
```

This is lower than I’d expected, considering that in the last two
seasons Predictaball has got 48% and
43% I expected the bookies
to be more accurate. It goes to show that football prediction is
extremely challenging even for the professionals, although it is
important to bear in mind that a bookies objective isn’t to maximise
prediction accuracy, but rather **profits** by offering worse value odds
(from a customer’s perspective).

We can break this down year on year, pooling together all the four leagues:

Across Europe there doesn’t appear to be any longitudinal shifts in bookie accuracy, I imagine that the league itself is a source of greater variation.

And what about accuracy by division?

```
res_clean %>%
group_by(league) %>%
summarise(accuracy = mean(correct) * 100) %>%
arrange(-accuracy) %>%
knitr::kable()
```

league | accuracy |
---|---|

laliga | 54.3 |

premiership | 53.9 |

seriea | 53.1 |

bundesliga1 | 51.0 |

Interestingly there doesn’t seem to be much difference between every
league **except** the Bundesliga which is noticeably less accurate.
People often deride La Liga for being too predictable, and while the
main league outcomers tend to be one of 3 teams, the individual matches
themselves look as tough to predict as those from any other league, with
only a very slightly higher number of upsets (occasions where the
favourite didn’t win).

Finally, we can look at accuracy by league and season, to see if there are any interesting patterns evident here which were masked when pooling the leagues together.

There are some interesting trends here, for instance the Premiership and
Bundesliga follow a similar trend, with the Premiership having larger
accuracies. The bookie’s accuracy in La Liga followed a very linear
trend from ^{2005}⁄_{2006} season culminating in the ^{2010}⁄_{2011} season, before
returning to similar levels as the other three leagues. Overall however,
there does not appear to be a significant temporal aspect to the
predictability of the leagues, with the league itself seemingly playing
a more important role in the bookies’ accuracy than any time related
factor.

## Inference

From the earlier work we’re left with the impression of a subtle, but not large, effect of league on number of upsets (measured by whether the bookies’ favourite won), with seemingly little temporal element. Now let’s investigate these factors in more depth.

### Chi-square

I’ll firstly run a $\chi^{2}$ test to see if there’s a statistically significant difference in the number of correct predictions by league, with the null hypothesis being that the proportion of upsets in each league is equal.

```
chisq.test(table(res_clean$correct, res_clean$league))
```

```
##
## Pearson's Chi-squared test
##
## data: table(res_clean$correct, res_clean$league)
## X-squared = 9.4037, df = 3, p-value = 0.02438
```

There does appear to be a statistically significant difference in proportion of upsets across the European leagues, even if it isn’t overly large in magnitude.

We can also investigate for a linear trend in seasons, using
`prop.trend.test`

to account for the fact that the independent variable
is ordinal:

```
season_tabulated <- res_clean %>%
group_by(season) %>%
summarise(num_correct = sum(correct),
num = length(correct),
num_wrong = num - num_correct)
```

```
prop.trend.test(season_tabulated$num_correct, season_tabulated$num)
```

```
##
## Chi-squared Test for Trend in Proportions
##
## data: season_tabulated$num_correct out of season_tabulated$num ,
## using scores: 1 2 3 4 5 6 7 8 9 10 11
## X-squared = 0.48574, df = 1, p-value = 0.4858
```

Here we don’t have sufficient evidence to reject the null hypothesis that there’s a linear trend which supports my initial conclusions from inspection of the time-series.

### Logistic regression modelling

Another way of investigating these two factors is in a logistic regression model, since the outcome is the proportion of favourites which won (the complement of number of upsets) which can be modelled using a binomial distribution.

Firstly I’ll create the null model for comparison later on. Quite a high deviance, indicating the model isn’t fitting too well.

```
mod_null <- glm(as.factor(correct) ~ 1, data=res_clean, family='binomial')
summary(mod_null)
```

```
##
## Call:
## glm(formula = as.factor(correct) ~ 1, family = "binomial", data = res_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.231 -1.231 1.124 1.124 1.124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.12589 0.01608 7.828 4.97e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21462 on 15525 degrees of freedom
## Residual deviance: 21462 on 15525 degrees of freedom
## AIC: 21464
##
## Number of Fisher Scoring iterations: 3
```

Adding the `league`

predictor only reduces the deviance by 10 (ignore
the AIC for now), with the coefficients for each league showing that
each league has a non-zero impact upon the predictability. Since the
baseline group here is the Bundesliga, the positive magnitude of the
betas shows that all three of the remaining leagues are more
predictable, with La Liga being the most.

```
mod_league <- glm(as.factor(correct) ~ league, data=res_clean, family='binomial')
summary(mod_league)
```

```
##
## Call:
## glm(formula = as.factor(correct) ~ league, family = "binomial",
## data = res_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.251 -1.231 1.105 1.125 1.161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03803 0.03448 1.103 0.27000
## leaguelaliga 0.13397 0.04743 2.825 0.00473 **
## leaguepremiership 0.11730 0.04638 2.529 0.01144 *
## leagueseriea 0.08749 0.04636 1.887 0.05915 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21462 on 15525 degrees of freedom
## Residual deviance: 21453 on 15522 degrees of freedom
## AIC: 21461
##
## Number of Fisher Scoring iterations: 3
```

A univariate model of the `season`

predictor decreases the deviance
further as expected due to having more dummy predictors (and therefore
fewer degrees of freedom) than `league`

. None of the individual seasons
have statistically significant effects, aside from 2013-2014.

```
mod_season <- glm(as.factor(correct) ~ season, data=res_clean, family='binomial')
summary(mod_season)
```

```
##
## Call:
## glm(formula = as.factor(correct) ~ season, family = "binomial",
## data = res_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.296 -1.214 1.063 1.121 1.166
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.195e-02 5.263e-02 1.367 0.17157
## season2006-2007 -4.429e-02 7.441e-02 -0.595 0.55170
## season2007-2008 7.842e-02 8.089e-02 0.969 0.33230
## season2008-2009 1.111e-01 7.456e-02 1.490 0.13611
## season2009-2010 1.307e-01 7.460e-02 1.752 0.07982 .
## season2010-2011 6.102e-02 7.449e-02 0.819 0.41265
## season2011-2012 -1.385e-02 7.442e-02 -0.186 0.85241
## season2012-2013 6.102e-02 7.449e-02 0.819 0.41265
## season2013-2014 2.036e-01 7.476e-02 2.724 0.00645 **
## season2014-2015 1.385e-02 7.444e-02 0.186 0.85237
## season2015-2016 7.767e-16 7.443e-02 0.000 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21462 on 15525 degrees of freedom
## Residual deviance: 21443 on 15515 degrees of freedom
## AIC: 21465
##
## Number of Fisher Scoring iterations: 3
```

The next step is to add both the league and season into a combined model, as shown below. This model again increases deviance as expected, but not by a large amount. Of the predictors, the dummy variables representing belonging to each league are statistically significant, while only one of the seasons is, reinforcing previous findings.

```
mod_seasonleague <- glm(as.factor(correct) ~ season + league, data=res_clean, family='binomial')
summary(mod_seasonleague)
```

```
##
## Call:
## glm(formula = as.factor(correct) ~ season + league, family = "binomial",
## data = res_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.318 -1.226 1.064 1.130 1.204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.818e-02 6.098e-02 -0.298 0.76556
## season2006-2007 -4.432e-02 7.443e-02 -0.595 0.55157
## season2007-2008 9.555e-02 8.153e-02 1.172 0.24120
## season2008-2009 1.112e-01 7.458e-02 1.491 0.13598
## season2009-2010 1.308e-01 7.462e-02 1.752 0.07972 .
## season2010-2011 6.106e-02 7.451e-02 0.819 0.41251
## season2011-2012 -1.385e-02 7.444e-02 -0.186 0.85236
## season2012-2013 6.106e-02 7.451e-02 0.819 0.41251
## season2013-2014 2.038e-01 7.478e-02 2.725 0.00643 **
## season2014-2015 1.386e-02 7.446e-02 0.186 0.85232
## season2015-2016 -3.274e-16 7.445e-02 0.000 1.00000
## leaguelaliga 1.381e-01 4.782e-02 2.888 0.00388 **
## leaguepremiership 1.174e-01 4.641e-02 2.531 0.01139 *
## leagueseriea 8.760e-02 4.639e-02 1.888 0.05899 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21462 on 15525 degrees of freedom
## Residual deviance: 21433 on 15512 degrees of freedom
## AIC: 21461
##
## Number of Fisher Scoring iterations: 3
```

We can also compare the models by AIC, which again reaffirms the belief that the league itself plays a far greater role in the number of upsets than the season.

```
data.frame(model=c('null', 'league', 'season', 'season+league'), aic=c(mod_null$aic, mod_league$aic, mod_season$aic, mod_seasonleague$aic)) %>%
arrange(aic) %>%
knitr::kable()
```

model | aic |
---|---|

league | 21460.82 |

season+league | 21461.22 |

null | 21464.21 |

season | 21464.91 |

## Overall conclusions

The findings support the hypothesis that La Liga contains the fewest upsets out of the four major European football leagues, although the actual magnitude of the difference isn’t that large. Perhaps more interesting, is why the Bundesliga contains noticeably fewer upsets than the other leagues. An longitudinal effect has also been investigated but can be safely ignored.

When I come to forming predictive models of football leagues outside of the Premiership I’ll want to bear these subtle differences in mind, most likely by using a multi-level model to account for the fundamental hierarchical nature. If I decide to go down the Bayesian route this will be a simple addition, whereby the league can be modelled as a random effect.