README.Rmd

---
title: R package doseminer
author: David Selby and Belay Birlie
output:
  github_document:
    df_print: kable
---

<!-- badges: start -->
<!-- badges: end -->

An R implementation of the text mining algorithm of [Karystianis et al. (2015)](https://doi.org/10.1186/s12911-016-0255-x) for extracting
drug dosage information from electronic prescription data (especially from CPRD).
The aim of this project is to provide a complete replacement for the algorithm,
entirely written in R with no external dependencies (unlike the original implementation, which depended on Python and Java).
This should make the tool more portable, extensible and suitable for use across
different platforms (Windows, Mac, Unix).

## Installation

You can install **doseminer** from CRAN using
```r
install.packages('doseminer')
```
or get the latest development version via GitHub:
```r
# install.packages('remotes')
remotes::install_github('Selbosh/doseminer')
```

## Usage

The workhorse function is called `extract_from_prescription`.
Pass it a character vector of freetext prescriptions and it will try to extract
the following variables:

- Dose frequency (the number of times per day a dose is administered)
- Dose interval (the number of days between doses)
- Dose unit (how individual doses are measured, e.g. millilitres, tablets)
- Dose number (how many of those units comprise a single dose, e.g. 2 tablets)
- Optional (should the dose only be taken 'if required' / 'as needed'?)

```{r}
library(doseminer)
extract_from_prescription('take two and a half tablets every two to three days as needed')
```

Anything not matched is returned as `NA`, though some inferences are also made.
For instance: if a dosage is specified as multiple times per day, with no
explicit interval between days, it's inferred the interval is one day.
Similarly, if an interval is specified (e.g. every 3 days) but not a daily
frequency, it's presumed the dose is taken only once during the day.

To see the package in action, a small vector of example prescriptions is
included in the variable `example_prescriptions`.

```{r}
extract_from_prescription(example_prescriptions)
```

The column `output` represents the 'residual' text after other features have been extracted.
It can be ignored for most applications, but is useful for debugging prescriptions that have not been parsed as expected.

## English words to numbers

Built into this package is a series of functions for extracting and parsing
natural language English numbers into their digit-based numeric form. This
could be spun out into its own package for more general use.

```{r}
replace_numbers(c('Thirty seven bottles of beer on the wall',
                  'Take one down, pass it around',
                  'Thirty-six bottles of beer on the wall!',
                  'One MILLION dollars.',
                  'We do not take any half measures'))
```

Inspired by Ben Marwick's `words2number` (https://github.com/benmarwick/words2number).

## Contributors

Maintained by David Selby (`david.selby@manchester.ac.uk`) and Belay Birlie.

## References

Karystianis, G., Sheppard, T., Dixon, W.G. *et al.*
Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database.
*BMC Med Inform Decis Mak* **16**, 18 (2015).  
https://doi.org/10.1186/s12911-016-0255-x