An R package for estimating disease prevalence from registry data

Camel Up is a deceptively simple board game in which the aim is to predict the outcome of a camel race. I’ll quickly try to explain the game now, although it’s always hard to explain a boardgame without an actual demonstration.
The camel movement is randomly generated from dice rolls as follows. Five dice coloured for each of the five camels, each labelled with the numbers 1-3 twice, are placed into a container (decorated as a pyramid, since the game is set in Egypt), which is then shaken.

In the last couple of months I’ve been teaching myself about multi-state survival models for use in an upcoming project. While I found the theoretical concepts relatively straight forward, I started having issues when I began to start implementing the models in software. There are many considerations to be made when building a multi-state model, such as:
Convert the data into a suitable long format Deciding whether to use either parametric or semi-parametric models Different subsets of the available covariates can be selected for each of the transition hazards In addition, covariates can be forced to have the same hazard ratio on every transition There’s a choice to be made between clock-forward or clock-reset (semi-Markov models) time-scales The Markov assumption can be further violated by including the state arrival times as part of the transition hazard; this often has theoretical justification The baseline hazards can be kept stratified by transition, or certain ones can be assumed to be proportional Needless to say, actually building a model was very time consuming.

I recently give a talk at my university’s R User group on how to publish packages to CRAN (slides here). This isn’t an easy topic to distill into a 60 minute slot, and so I had to abandon my original idea of a hands on workshop with examples in favour of a condensed summary of the main challenges in the submission process. This mostly focused on the issue of Namespaces, since this is a rather complex topic to understand if you’re coming from a non-software engineering background, as it doesn’t come up in day-to-day statistical analysis.

At ECSG (Epidemiology and Cancer Statistics Group), we primarily work with myeloid and lymphoid disease registries. Resulting from our successful collaborative research project - HMRN (Haematological Malignancy Research Network) - we have access to a large observational dataset of haematological malignancies across Yorkshire. From this we can estimate various measures of interest, such as the effect of standard demographic factors (mainly age and sex) on incidence rates, any longitudinal incidence trends, in addition to numerous statistics related to survival, for example noting any clinical or demographic factors associated with a high risk level.

I’ve been working on another paper today and decided to update my previous xtable function (as described here) to use dplyr, as I want to fully get to grips with Hadley Wickham’s wonderful ecosystem of packages including dplyr (and its predecessor plyr), ggplot2 and tidyr (and its predecessor reshape2). I mentioned this before Christmas but have only got round to it now, which included a few hours of struggling with tidyr to make it do what I want!

I’ve recently decided to start using Sweave for producing my publications since I already use R for the data analysis side and LaTeX for the markup, so it seems natural to combine them. In a nutshell, Sweave lets you embed R output directly into your documents, allowing for a more organised workflow. You mark a section as containing R code, then run your analyses with your output, be it in the form of text, a table, or a chart, formatted directly into LaTeX markup.

I’ve never fully taught myself R, just dipped in and out when necessary. I’ve primarily used it for standard data analysis and visualisation, although I have been meaning to get to grips with one of the numerous available machine learning packages. Dealing with datasets tends to involve a lot of hacky manipulation until it’s in a useful format for your analysis. Initially I was just trying to use standard library functions, although once I came upon the essential reshape2 package and the ease with which you could convert your dataframe between wide and long formats I knew I was going to have to use a different approach.

© 2019 Stuart Lacy · Powered by the Academic theme for Hugo.