As a modern statistician, one of the most fundamental contributions you can make is to create and distribute software that implements the methods you develop. I have gone so far as to say if you write a paper without software, that paper doesn't exist.
The purposes of this guide are:
- To explain why writing software is a critical component of being a statistician.
- To give you an introduction into the process/timing of creating an
R
package. - To help you figure out how to distribute/publicize your software.
- To remind you that "the perfect is the enemy of the very good".
- To try to make sure Leek group software has a consistent design.1
Stay tuned for the Leek group approach to building/sharing data analyses.
Cause you know, you do what your advisor says and stuff.
But there are some real reasons to write software as a statistician that I think are critically important:
- You probably got into statistics to have an impact. If you work with Jeff it was probably to have an impact on human health or statistics. Either way, one of the most direct and quantifiable ways to have an impact on the world is to write software that other scientists, educators, and statisticians use. If you write a stats method paper with no software the chance of impacting the world is dramatically reduced.
- Software is the new publication. I couldn't name one paper written by a graduate student (other than mine) in the last 2-3 years. But I could tell you about tons of software packages written by students/postdocs (both mine and at other places) that I use. It is the number one way to get your name out there in the statistics community.
- If you write a software package you need for yourself, you will save yourself tons of time in sourcing scripts, and remembering where all your code/functions are.
Most importantly might be that creating an R
package is building something. It is something you can point to and
say, "I made that". Leaving aside all the tangible benefits to your career, the profession, etc. it is maybe the
most gratifying feeling you get when working on research.
As soon as you have 2 functions.
Why 2? After you have more than one function it starts to get easy to lose track of
what your functions do, it starts to be tempting to name your functions foo or tempfunction or some other such
nonsense. You are also tempted to put all of the functions in one file and just source it. That was what I did
with my first project, which ended up being an epically comical set of about 3,000 lines of code in one R
file.
Ask my advisor about it sometime, he probably is still laughing about it.
To start writing an R
package you need:
R
- I mean, unless you are a wizard.- Your functions, each written in a separate file2.
- At minimum the following packages installed: devtools, roxygen2 (suggested by devtools), knitr.
- If you are going to use C code: Rcpp
- Install git.
- A github account.
- To read Hadley's intro.
- To read Karl's Github tutorial.
The first step in creating your R
package is to give it a name. Hadley has some ideas
about it. Here are our rules:
- Make it googleable - check by googling it.
- Make sure there is no Bioconductor/CRAN package with the same name.
- No underscores, dashes or any other special characters/numbers
- Make it all lower case - people hate having to figure out caps in names of packages.
- Make it memorable; if you want serious people to use it don't be too cute.
- Make it as short as you possibly can while staying googleable.
- Never, under any circumstances, let Rafa or Hector name your package.3
Almost all of our packages will eventually go on Bioconductor. So we are going to use the versioning scheme that is compatible with that platform (with some helpful suggestions from Kasper H.).
The format of the version number will always be x.y.z
. When you start any new package the version number should be
0.1.0
. Every time you make any change public (e.g., push to GitHub) you should increase z
in the version number. If
you are making local commits but not making them public to other people you don't need to increase z
. You should
stay in version 0.1.z
basically up until you are ready to submit to Bioconductor (or CRAN) for release.
Before release you can increase y
if you perform a major redesign of how the functions are organized or are used. You
should never increase x
before release.
The first time you submit the package to Bioconductor you should submit it as version number 0.99.z
. That way on the next
release of Bioconductor it will get bumped to 1.0.0
. The next devel version will get bumped to 1.1.0
on Bioconductor.
Immediately after releasing, if you plan to keep working on the project, you should bump your version on GitHub to 1.1.0
.
Thereafter, again you should keep increasing z
every time you make a public change. If you do a major reorganization
you should increase y
.
Run this code from R
to create your package. It will create a directory called packagename and put some stuff in it
(more on this stuff in a second).
## Setup
install.packages(c("devtools", "roxygen2", "knitr"))
## Load the libraries
library("devtools")
library("roxygen2")
library("knitr")
## Create the package directory
create("packagename")
Put your package on GitHub
All packages that are developed by the Leek group will be hosted on GitHub. The accounts are free and it makes it so much easier to share code/get other people to help you with your code. Here are the steps to getting your package on GitHub:
- Create a new Github repository with the same name (packagename)
- In the packagename directory on your local machine, run the commands: git init
- Then run: git remote add origin git@github.com:yourusername/packagename.git
- Create a file in the packagename directory called README.md
- Run the command: git add *
- Run the command: git commit -m 'initial commit'
- Run the command: git push -u origin master
In summary:
mkdir packagename
cd packagename
git init
git remote add origin git@github.com:yourusername/packagename.git
git add *
git commit -m 'initial commit'
git push -u origin master
- Use commit messages that will help you remember what you did and why you did it.
- If you interact very frequently with GitHub you will be interested on setting up SSH keys to avoid typing your password every time you push/pull.
- You can mark specific versions of your package using Git Tags which allows you to easily check the state of the package at that particular version.
- If more than one person is working on developing the package or you want to contribute to one, check how to fork a repository. It is an easy way to contribute with a very low burden on the maintainer and no setup.
- Consider whether you want users to report issues to your package. It is a very good framework for issue management, but can lead to duplicate information if the main issue reporting/tracking system is a mailing list like in the case of Bioconductor packages. For an example of how GitHub's issue system looks like check the rCharts issues.
Once you're familiar with basic git and GitHub workflows, GitHub has some more advanced features that you can take advantage of. In particular, github flow is an excellent way to manage contributions, and GitHub organizations can provide a central location for multiple people (e.g. in a lab) to collaborate on software.
The R
functions you have written go in the R/ directory in the packagename folder. Each of your R
functions
should go in a separate file with a .R
extension. We are going to use capital R
for the extension of the files.
Why? Don't ask questions.
If you define a new class call the .R
file classname-class.R. For example, if you are creating the leek class of objects
it would be called leek-class.R. If you are defining a new method for the class it should
be named newclass-methodname-method.R. For example, a plotting method for the leek class would go in a .R
file
called leek-plot-method.R.
The DESCRIPTION
file is a plain text file that gets generated with the devtools::create command.
- The package name should go after the colon on the first line.
- The package title should be a one sentence description of what the package actually does.
- The description should be a one paragraph description that builds on the title. It should give a user some idea about what kind of data your software should be used on, what the inputs are and what the outputs are.
- The version should be defined as described above.
- The authors field may have a @R before the colon which should be deleted. The authors should be in the format author name authoremail@host.com for example: Jeff Leek jleek@jhsph.edu and should be comma separated.
- A maintainer field should be added with maintainers listed as comma separated files. You are the maintainer of your package when you create it. See the section below on after you leave the Leek group for more information.
- The dependencies (other
R
packages your software uses/depends on) should be listed in a comma separated list after theR
version. One of the dependencies should be the knitr package for the vignette. - The License is required to be open source. I like GPL-2 or GPL-3. I like the creative commons licenses, like CC-BY-SA, for manuscripts, but they are not recommended for software. This is a good website for learning more about software licenses. Also see Jeff Atwood's comments on licenses.
- You should add a line VignetteBuilder: knitr
- You should add a line Suggests: knitr, BiocStyle
For example:
Package: packagename
Type: Package
Title: A sentence
Version: 0.1.0
Date: 2013-09-30
Authors@R: c(person("Jeff", "Leek", role = c("aut", "cre", "ths"),
email = "jleek@jhsph.edu"))
Depends:
R(>= 3.0.2),
knitr
Suggests:
knitr,
BiocStyle
Description: A couple sentences that expand the title
License: Artistic-2.0
I will try to keep the stylistic requirements minimal because they would likely drive you nuts. For now there are:
- Your indent should be 4 spaces on every line
- Each line can be no more than 80 columns
You can set these as defaults (say in Sublime or RStudio) then you don't have to worry about them anymore. If you find lines going longer than 80 columns, you will need to write the lines into functions, etc.
This is how I feel about the relative importance of various components of statistical software development:
Ideally your software is easy to understand and just works. But this isn't Apple and you don't have a legion of test users to try it out for you. So more likely than not, at least the first several versions of your software will be at least a little hard to use. The first versions will also probably be slower than you would like them to be.
But if your software solves a real problem (it should!) and is well documented (it will be!) then people will use it and you will have a positive impact on the world.
Documentation has two main components. The first component is help files, which go into the man/ folder. The second component is vignettes which will go in a folder called inst/doc/ which you will have to create. I'll tackle each of these separately.
These files document each of the functions/methods/classes you will expose to your users. The good news is that
you don't have to write these files yourself. You will use the roxygen2 package to create the man files. To use
roxygen2 you will need to document your functions in the .R
files with comments formatted in a specific way. Right
before your functions you should have a set of comments that are denoted by the symbol #'. They are structured in
the following way:
#' A one sentence description of what your function does
#'
#' A more detailed description of what the function is and how
#' it works. It may be a paragraph that should not be separated
#' by any spaces.
#'
#' @param inputParameter1 A description of the input parameter \code{inputParameter1}
#' @param inputParameter2 A description of the input parameter \code{inputParameter2}
#'
#' @return output A description of the object the function outputs
#'
#' @keywords keywords
#'
#' @export
#'
#' @examples
#' R code here showing how your function works
myfunction <- function(inputParameter1, inputParameter2){
## Awesome code!
return(result)
}
You include the @export command if you want the function to be exported (i.e. visible) to your end users. Hadley has a pretty comprehensive guide where you can learn a lot more about how roxygen works. Your function follows immediately after the comments.
When you have saved functions with roxygen2 style comments you can create the .Rd
files (the man files themselves)
by running:
library("devtools")
document("packagename")
on the package folder. The package folder must be in the current working directory where you are editing.
Please read Hadley's guide in its entirety to understand how to document packages and in particular, how roxygen2 deals with collation and namespaces.
Documentation in the help files is important and is the primary way that people will figure out your functions if they get stuck. But it is equally (maybe more) critical that you help people get started. The way that you do that is to create a vignette. For our purposes, a vignette is a tutorial that includes the following components:
- A short introduction that explains
- The type of data the package can be used on
- The general purpose of the functions in the package
- One or more example analyses with
- A small, real data set
- An explanation of the key functions
- An application of these functions to the data
- A description of the output and how it can be used
We will write Vignettes in knitr. Vignettes can generate either HTML from R markdown, or pdf from latex. In either case, the files should be in packagename/vignettes/. During package building they will be moved to packagename/inst/doc. For HTML vignettes, the vignette files should be vignette.Rmd
. For PDF vignettes, it should be vignette.Rnw
. Here is some information
from Yihui about building vignettes in knitr.
For latex vignettes, you should use the BiocStyle
package to style your vignette. This means you will need to add this code to the preamble of your Rnw
file:
<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@
See the BiocStyle vignette for commands that you can use when creating your vignette (e.g. \Biocpkg{IRanges} for referencing a Bioconductor package).
For our purposes anyone who wrote a function exposed to users in the package (a function that has an @export in the documentation) will be listed as an author.
You will be a maintainer and Jeff will be a maintainer. If you can sucker one of your fellow students into maintaining the package as well, you can list them, but they must make the same commitment to 5 years of support.
One thing I think a lot of academics (definitely including myself here) struggle with is the need to be "new" and "innovative" with everything they do. There is a strong selective pressure for these qualities in academia.
But when writing software it is very, very important not to reinvent every single wheel you see. One person who is awesome at blending existing tools and exponentially building value is Ramnath Vaidyanathan. He built slidify on top of knitr and rCharts on top of existing D3 libraries. They allowed him to create awesome software without having to solve every single problem.
Before writing general purpose functions (say for regression or for calculating p-values or for making plots) make sure you search for functions that already exist (or ask Jeff/your fellow students if you aren't sure).
An important and balancing principle is that you should try to keep the number of dependencies for your package as small as possible. You should also try to use packages that you trust will be maintained. Some ways to tell if a package is "trustworthy" are to check the number of downloads/users (higher is better), check to see if the package is being actively updated (on GitHub/Bioconductor/CRAN) and there is a history of updates, and check to see if the authors of the packages routinely maintain important packages (like Hadley, Yihui, Ramnath, Martin Morgan, etc.). Keep in mind your commitment (see below) when making decisions about whose functions to use.
A major temptation for everyone creating code is to generate a bunch of features they think that users will want without actually testing those features out. The problem is that each new feature you create in your package will monotonically increase the number of dependencies and the amount of code you have to maintain. In general, the principle should be to create exactly enough functions that the users can install your package, perform your analysis, and return the results, with no extraneous functionality.
Specifically, be wary of things like GUIs or Shiny apps. Given the heavy emphasis placed on reproducibility these days, it is rarely the case that real/important analyses will be performed in a point and click format.
If you are way into creating products that point-and-click users will be interested in I'm very happy to support you in that, since I think those things are cool, probably the future, and can certainly raise the interest in your work. But they present a potentially major difficulty in maintainence and should be placed in separate packages on your own account.
Unit tests are an important for the following reasons:
- They make it easier for you to find bugs in your code
- They make it easier for you to figure out if your code works together
- They make you slow down and think about what you are doing
Your advisor tends to rush through things. While there is a strong interest in being competitive/getting things done in the Leek group, there is an even stronger interest in doing them right and for the long term. We will use the testthat framework for building unit tests. A couple of suggestions for using the framework are:
- Organize your tests into contexts (groups of tests) and name them.
- Have some tests that test the basic helper functions on fixed data.
- Have some tests that check the output of your main functions on fixed data.
- Have some tests that can be run on random data - these can help find bugs before they happen
To use the testthat package you will put a file called test-area-packagename.R in
the inst/tests
directory, where area is the name of the broad class of functions you are testing. Then add
another file called run-all.R in the same directory. The run-all.R function has this code in it:
library(testthat)
library(mypackage)
test_package("mypackage")
This means that whenever you run R CMD check
on your package, you will get an error if one of the unit tests fails. This
is important, because it will be one way for me to check your code and to tell you if you have broken your code
with an update.
Here is an example unit test, so you get the idea of what I'm looking for. Below is a function for performing differential expression analysis on a matrix.
#' A function for differential expression analysis
#'
#' This function takes a matrix of gene expression data
#' (genes in rows, samples in columns) and a factor variable
#' with two levels and performs differential expression analysis
#'
#' @param data A gene expression data matrix with genes in rows and samples in columns
#' @param grp A two-level factor variable with two levels
#'
#' @return pValues The p-values from the differential expression test.
#'
#' @keywords differential expression
#'
#' @export
#'
#' @examples
#' R code here showing how your function works
deFunction <- function(dat, grp){
if(class(grp)!="factor"){stop("grp variable must be a factor")}
if(length(unique(grp))!=2){stop("grp variable must have exactly two levels")}
if(any(genefilter::rowSds(dat)==0)){stop("some genes have zero variance")}
result = genefilter::rowttests(dat,grp)$p.value
return(result)
}
Now create a new file called test-defunction.R. The code in that file is:
context("tests on inputs")
test_that("tests for grp variable",{
set.seed(12345)
dat <- matrix(rnorm(100*30),nrow=100,ncol=30)
grp <- rep(c(0,1),each=15)
expect_that(deFunction(dat,grp),throws_error("grp variable must be a factor"))
grp <- as.factor(rep(c(0,1,2),each=10))
expect_that(deFunction(dat,grp),throws_error("grp variable must have exactly two levels"))
grp <- as.factor(rep(0,30))
expect_that(deFunction(dat,grp),throws_error("grp variable must have exactly two levels"))
})
test_that("tests for dat variable",{
set.seed(12345)
grp <- as.factor(rep(c(0,1),each=15))
dat <- matrix(0,nrow=100,ncol=30)
expect_that(deFunction(dat,grp),throws_error("some genes have zero variance; t-test won't work"))
})
context("test on outputs")
test_that("test p-values are numeric and non-zero",{
set.seed(12345)
grp <- as.factor(rep(c(0,1),each=15))
dat <- matrix(matrix(rnorm(100*30)),nrow=100,ncol=30)
expect_that(deFunction(dat,grp),is_a("numeric"))
expect_that(all(deFunction(dat,grp) > 0),is_true())
})
If you run the command: test_file("test-defunction.R")
you should get the output:
tests on inputs : ....
test on outputs : ..
without any error messages. But if you accidentally delete one of your error checking messages at the beginning of the function and rerun the tests, it will tell you which context and which test broke.
The thing to keep in mind here is that you would be doing these tests on the fly/manually anyway. So you might as well write them into a set of unit tests that will give you an idea of where your functions are breaking. Some things that you should be testing:
- That your error messages catch what they should
- That your outputs are what you expect them to be
- That you get reasonable answers for reasonable inputs
- When you know what the answer should be - you get it
Hopefully your package will have a ton of users. Inevitably, they will try to use the software for purposes you did not intend. Some of these will be happy things (software you wrote for microarrays being used in sequencing). Sometimes they will be unhappy - people using it completely out of context and getting wonky answers.
A major component of dummy proofing is writing thorough documentation and vignettes (see the above sections). When you see weird cases (reported from users or cases you find yourself) add them to the documents. There is one additional important component of dummy proofing we will use: input dummy proofing (related to unit testing).
Most of the time, if you input non-matching arguments or arguments of the wrong class into R functions a cryptic
error will result. This will cause you headaches when maintaining your software. So you should add some commands
with the stop
function at the beginning of each function that throw errors if the arguments don't have the correct
class or would result in silly output (like the zero variance gene test in the unit testing example above).
Releasing to Bioconductor
Once you have developed a package you should use it yourself for a couple of weeks. Then you should have at least one other student use it without you giving them any instructions other than telling them where some data are - that way you can test your documentation. Finally, you should meet with Jeff and get ready to release it to Bioconductor.
The steps in releasing to Bioconductor are:
- Follow Bioconductor's checklist to confirm
the package is ready to upload.
- In particular, check that your package passes R CMD check in less than 5 minutes. You can use
library("devtools")
check_doc("packagename") ## Only for checking the documentation
system.time(check("packagename")) ## R CMD check with time information
- Update the version number and push to GitHub. In the commit comments, state it is the version being pushed to Bioconductor.
- Send an email as described in the checklist stating that you want an account and want to submit a package.
- Submit the package to Bioconductor.
- Update the version number (bump
y
andz
) to the next odd number forz
. In the commit comments, state that this is the new devel version.
You are vested in creating the software you are working on now since it is part of your training and will definitely help your short term career in terms of getting jobs/gaining visibility. But your time as a graduate student is (happily for you) limited. After a few years you will graduate and head off to an awesome job. The software you built as a student may be less important to you.
Hopefully, though, it is still very important to the users who are applying that software to solve the most pressing problems in molecular biology and medicine. It is unfair to those users if your software breaks down and can't be installed/used.
So if you undertake a research project in the Leek group which involves software development (pretty much all of them will) you commit to maintaining that software for at least 5 years after you graduate. Of course, I can't hold you to that like a contract or anything, but think of it as an honor thing.
5 years is a long time. It is most of the way toward tenure (in academia) or probably 3 years after you have started your own awesome company and appeared on Techcrunch. So it is worth thinking about ways you can ensure that the maintainence will be as low as possible. Specifically:
- Make the dependencies as minimal as possible. If your dependencies update, you'll have to update the software.
- Only create functions that are absolutely necessary for the package. It is hard to delete functions from a package after it is released without messing with users and adds to the maintainence headache each time you keep one in.
- Make the vignette really clear and add a FAQ with questions you get from users while you are still in the Leek group.
- Add comments to your code/unit tests so that when something breaks you can find/fix the problem quickly.
The first version of this tutorial was written by Jeff Leek (@simplystats). Hopefully
he can sucker his students into contributing since they are much, much better R
programmers than he is.
Thanks to the beauty of Github a few others have made contributions.
- Titus Brown added the GitHub flow and organization links. You can see his group's setup at ged-lab.
- Karl Broman fixed some problems with license recommendations.
- Leo Collado Torres fixed some typos/added links
- Alyssa Frazee fixed typos/corrected grammar
- Robert M Flight corrected some information about writing vignettes.
- These design requirements are subject to update and may not reflect Leek group software created before 9/18/2013 (or ever really, remember the perfect is the enemy of the very good).
- Except for utility functions, more on this later.
- Ask me about the time they named one of my packages "succs".
- With proper attribution, of course.