In the previous entry of what has evidently become a series on modelling binary mixtures with Dirichlet Processes (part 1 discussed using pymc3 and part 2 detailed writing custom Gibbs samplers), I ended by stating that I’d like to look into writing a Gibbs sampler using the stick-breaking formulation of the Dirichlet Process, in contrast to the Chinese Restaurant Process (CRP) version I’d just implemented.
Actually coding this up this was rather straight forward and took less time than I expected, but I found the differences and similarities between these two same ways of expressing the same mathematical model interesting enough for a post of its own.
Back at the start of the year (which really doesn’t seem like that long a time ago) I was looking at using Dirichlet Processes to cluster binary data using PyMC3. I was unable to get the PyMC3 mixture model API working using the general purpose Gibbs Sampler, but after some tweaking of a custom likelihood function I got something reasonable-looking working using Variational Inference (VI). While this was still useful for exploratory analysis purposes, I’d prefer to use MCMC sampling so that I have more confidence in the groupings (since VI only approximates the posterior) in case I wanted to use these groups to generate further research questions.
While I’ve been quite happy with the performance of my Predictaball football rating system, one thing that that’s bothered me since its inception last summer is the reliance on hard-coded parameters.
Similar to many other football rating methods, it’s an adaptation of the Elo system that was designed for Chess matches by Arpad Elo in the 1950s. His aim was to devise an easily implementable system to rate competitors in a 2-person zero-sum game.
I came across a tweet from Piers Morgan this morning in which he suggested that the BBC is favouring women since 43 out of the 53 paper reviewers on The Andrew Marr Show in 2019 were women. Unfortunately I was a day late to this hot-take, fortunately this is because I don’t follow Piers Morgan. However, I knew that there must be more to it than a single PC-baiting statistic and knowing that I had a ~3 hour train journey coming up this evening I thought I’d look into it a bit more.
Excuse the clickbait title, but I genuinely couldn’t think of a better way of organising this post.
A second post in 2 days on mixture modelling? No awards for guessing what type of analysis I’ve been preoccupied with recently!
Today’s post provides an ugly hack to fix a bug in the R flexmix package for likelihood-based mixture modelling and provides a cautionary tale about environments. In short, I’ve encountered problems when trying to predict the cluster membership for out-of-sample data using this package, and judging from a couple of posts I found online, I’m not the only one.
I’ve been spending a lot of time over the last week getting Theano working on Windows playing with Dirichlet Processes for clustering binary data using PyMC3. While there is a great tutorial for mixtures of univariate distributions, there isn’t a lot out there for multivariate mixtures, and Bernoulli mixtures in particular.
This notebook shows an example of fitting such a model using PyMC3 and highlights the importance of careful parameterisation as well as demonstrating that variational inference can prove advantageous over standard sampling methods like NUTS for such problems.
eXpected Goals (xG) is a popular method of answering that age old question of which team ‘deserved’ to win a match. It does this by assigning a probability of a goal being scored from every opportunity based upon various metrics, such as the distance from goal, number of defenders nearby, and so on. By comparing a team’s actual standings with those from the output of an xG model we get a retrospective measure of how well a team is doing given their chances.
A new version of multistateutils has been released onto CRAN containing a few new features. I’ll give a quick overview of them here, but have a look at the vignette for more examples.
msprep2 The first is a replacement for the mstate::msprep function that converts data into the long transition-specific format required for fitting multi-state models. msprep requires the input data to be a in a wide format, where each row corresponds to an individual and each possible state has a column for entry time and a status indicator.
Having become interested in football again due to the World Cup, I was thinking about Predictaball and how I never wrapped up the season with a brief review.
It’s been a big season for Predictaball, with the move to an Elo-based system, as well as the launch of a website. However, is the new match forecasting method any good?
Model accuracy Fortunately, to help answer this question, a very generous Twitter user by the name of Alex B has been collecting weekly Premiership match predictions from around 30 models and tracked their progress.