I’ve always been curious to know if any of the 4 major European leagues (Serie A, Bundesliga, Premiership, La Liga) are more predictable than others. La Liga certainly has a reputation as being dull and predictable, although this is due to the sheer dominance of Barcelona and Real Madrid in recent years. I’ve increased my database of football matches in order to improve my football prediction bot this summer, and so now have sufficient data to investigate.
“Two Cultures” One aspect of statistical modeling which can be taken for granted by those with a bit of experience, but may not be immediately obvious to newcomers, is the difference between modeling for explanation and modeling for prediction. When you’re a newbie to modeling you may think that this only has an effect on how you interpret your results and what conclusions you’re aiming to make, but it has a far bigger impact than that, from influencing the way you form the models, to the types of learning algorithms you use, and even how you evaluate their performance.
I’ve never fully taught myself R, just dipped in and out when necessary. I’ve primarily used it for standard data analysis and visualisation, although I have been meaning to get to grips with one of the numerous available machine learning packages. Dealing with datasets tends to involve a lot of hacky manipulation until it’s in a useful format for your analysis. Initially I was just trying to use standard library functions, although once I came upon the essential reshape2 package and the ease with which you could convert your dataframe between wide and long formats I knew I was going to have to use a different approach.