# An R package for estimating disease prevalence by simulation: rprev

At ECSG (Epidemiology and Cancer Statistics Group), we primarily work with myeloid and lymphoid disease registries. Resulting from our successful collaborative research project - HMRN (Haematological Malignancy Research Network) - we have access to a large observational dataset of haematological malignancies across Yorkshire. From this we can estimate various measures of interest, such as the effect of standard demographic factors (mainly age and sex) on incidence rates, any longitudinal incidence trends, in addition to numerous statistics related to survival, for example noting any clinical or demographic factors associated with a high risk level.

One useful measure of a disease that is harder to estimate is disease prevalence, the number of people alive with a specific disease. Prevalence is typically measured by a point estimate at a specific index date as a ratio of number of cases to a specific population level, i.e. 3.5 per 100,000. Naturally, prevalence is a function of both incidence and survival, and so can typically be estimated from a combination of patient level data and national statistics. However, traditional analytical approaches yield unwieldy integrals and equations to calculate standard errors, and so do not provide ideal for interpretable results.

In recent years, owing to significant advances in computational power, statistical computing has become a well researched discipline. Statistical computing offers an alternative to traditional analytical statistics estimation by producing a stochastic model of the process which generates the data, before sampling from this process to generate simulated data. If the model is appropriate, then repeated sampling generates representative data sets from which summary statistics - including standard error - can be readily calculated.

In a recent paper written by colleagues, Crouch, Simon, et al. Determining disease prevalence from incidence and survival using simulation techniques Cancer epidemiology 38.2 (2014): 193-199., a simulation approach to prevalence estimation was presented. One significant benefit of this technique is that it can be calculated solely from highly granular patient level data and provides confidence levels, avoiding complex calculations inherent in an analytical approach. It relies upon having an accurate registry data set, with an appropriate duration.

One of the projects I’ve been working has been to help develop an R package from this algorithm, so that others with disease registry data sets can produce their own prevalence estimates. I’m proud to announce that the first version of rprev is now available on CRAN. This implementation will provide estimates for any suitable data set, however, it is important to consider whether the output estimates are reliable or not before using them to inform practice or in external use. The current version of rprev makes two key assumptions regarding the disease which must be met for the prevalence estimates to be reliable:

• The incidence process takes the form of a homogeneous Poisson process
• Disease survival can be accurately modelled by a Weibull regression on age and sex

We provide diagnostic functions for assessing the validity of these assumptions, and in future releases we will allow the user to specify the incidence process and survival model. Furthermore, the simulation process relies upon having an appropriate sample data set which accurately reflects the population in consideration.

We have provided a user guide detailing how to use the package; for most people the prevalence function is most useful as it does the heavy duty work. However, we strongly recommend reading the user guide first, along with the function documentation, so that the resulting estimates are as accurate and reliable as possible.

Any feedback is very welcome for the next release of rprev, see the package’s CRAN page for contact details.