Skip to content

Reproducibility in R

Erika Austhof edited this page Oct 19, 2022 · 13 revisions

Reproducibility in R Workshop

9/6 Git Bash and Terminal Work

In this workshop we worked on setting up Git Bash in RStudio and learning to navigate the command line in the terminal within R. One of the outcomes from this workshop was learning how to navigate different files and determining which folders on our own local computer we might use the most often. I assume that my most frequent folders would be:

  1. c/users/erika/Documents/R for learning R commands and for documenting scripts
  2. c/users/erika/Documents/Work for data and files related to work tasks and projects
  3. c/users/erika/Documents/Dissertation Work for data and files related to my thesis work

Shell commands are really helpful for learning how to navigate folders in R. Here are some important commands I used:

  • pwd: command to print your current working directory
  • cd: change directory/folder
  • ls: list the files and folders in your current directory
  • Commands have optional arguments that you can add to the command with a dash
  • Find manual page for commands by doing [command] --help or man [command] (exit man by hitting q)
  • Download shell-data folder and move to your Desktop folder
  • ls [folder] will list contents of a subfolder from your current location
  • Unzip shell-data folder using cd Desktop then unzip shell-lesson-data.zip
  • File and folder names should have no spaces in them; use dashes or underscores instead
  • Can use cd multiple times to move further into subfolders or put together a path like cd Desktop/project
  • Absolute paths start with a / and work from anywhere on your computer; relative paths depend on where you are currently located
  • An absolute path is like the address for a building, while relative paths are like directions from where you’re at to another location
  • Move up one folder level using cd ..
  • . is shorthand for your current folder and .. is shorthand for the folder above your current folder

9/8 Learning Git for Version Control

For secure data, you can store that in a local machine that all the users can access (or BoxHealth), but you can use Git and GitHub for storage of code, and then change the file path. In general, GitHub is not a data repository because there are size limits for public repositories (private is unlimited). In git, there are three parts: saving the data, staging the information to a staging area, and then committing the change to a repository

Here are some helpful commands I learned and notes for this session

  • When you start git for the first time, you need to configure your git username and email (only need to do this once per computer): git config --global user.name "[name]", git config --global user.email "[GitHub email]"
  • Check current git config settings with git config --list
  • One main folder per project, which we’ll turn into a git repository now and later an RStudio project
  • Create new folder with mkdir [directory name]
  • From within a project folder, turn it into a git repo using git init
  • Hidden files have names that begin with a period, can see hidden files using ls -a
  • Can see info about your git repo using git status
  • History of moving from master to main terminology: https://www.jumpingrivers.com/blog/git-moving-master-to-main/
  • Move files to the stage with git add [file name]
  • Commit the staged files with git commit -m "[message]"
  • Each commit represents a chunk of work, or a new version of your files
  • Create a new R script, save it (Jessica recommends using a numbered system for file naming!), and do git status to see what has changed in repo
  • git add 01-clean-soil-data.R and git commit -m "Initialize cleaning script"
  • Commit messages usually start with a present tense verb
  • You can add multiple files with git add to include multiple files in a particular commit
  • Can get list of commits using git log
  • To go back to an original commit you can use git log and copy the number for the commit, then use git diff [number] to go back to a previous version

9/13 Sharing with GitHub

We learned the difference between Git (which we learned last time), and GitHub. They shared about the importance of GitHub again in order to share projects publicly, collaborate with others, displays commits and logs much better than the log in Git. We followed this tutorial from the SW Carpentry

Here are some helpful commands I learned and notes for this session:

  • Moving files with mv, first argument is path to file to be moved, second argument is path to where it should be moved to
  • Create .gitignore as a text file, type paths/names of files you do NOT want git to track
  • Must be named .gitignore
  • Should be located in the main repository folder
  • Can list files by name, entire directories, wildcard with file extensions (e.g., *.pdf)
  • Exceptions to wildcard can be made with ! in front of a particular path/file name
  • Create file with touch and the file name/extension that you want
  • Remove files with rm and file name
  • Security keys (ssh keys) on your computer, share public key to GitHub
  • To check .ssh/ folder, ls -al ~/.ssh
  • ssh-keygen -t ed25519 -C [youremailaddress@yourdomain.edu] or can name something like "Jessica@UA_laptop"
  • Add or don't add password
  • cat ~/.ssh/id_ed25519.pub
  • Copy and paste output (with right click on a Windows) to a new SSH key on GitHub; name something that identifies which computer/machine the private key is located on, e.g. "Jessica@UA_laptop"
  • git remote add origin git@github.com:username/reponame.git to add the connection to the repo name * git remote -v to check that your remotes are set up correctly
  • git push origin main/master to amend to GitHub repo

9/15 Branching and Collaborating on GitHub

When you're ready to push changes to GitHub, you need to send a pull request and then update your local computer. Here is a tutorial to help.

  • Make sure you're in the right directory cd ./pilot-analyses
  • Session #4 notes (branches and forking with GitHub) from Jessica
    • No matter what, you need to delete your branch on GitHub AND on your local
  • git pull to pull changes down from GitHub
  • git branch to see what branches there are
  • git checkout -b make new branch and move there
  • git log to check commits, hit q to close out of the log
  • git checkout branch-name to move between branches
  • git push origin branch-name to push a branch to GitHub
  • git checkout main/master , git merge branch-name to merge branch into main/master
  • git branch -d branch-name to delete (after merging)
  • git push origin main/master to push the merged main branch to GitHub
  • git branch -D branch-name to delete (even if commits are not merged)
  • After branch is pushed, can open pull request (PR)
  • PR can be merged on GitHub
  • git checkout main/master , git pull origin main/master to pull changes from GitHub to our local repo
  • Fork button to make your own copy of the repository on your GitHub account
  • pwd and cd to move to directory where you want to make a local copy of the GitHub repo
  • git clone git@github.com:username/reponanme to clone (make local copy) to your computer
  • cd plant-research-compendium-ex to get into the new repo

9/20 Project Management

  • upstream means another persons GitHub project, origin is your own personal GitHub, local is our local computer
  • if you're producing a lot of figures or datasets, you might not want Git to track them, so you could include them in .gitignore so that it doesn't need to be tracked along commits. Then when you're ready to share final files, you could remove them from .gitignore
  • modularizing your code into smaller scripts (clean, plot, model, etc.) can help readability rather than keeping it all in one script
  • You usually only fork a repo for yourself once, and then clone it to get a local copy once per computer
  • Can update local repo due to changes in upstream repo using git fetch and git merge (we’ll do this in a later session)
  • Research compendium: a way of organizing your research projects to make them more reproducible and understandable
  • Example folders that can go in compendium include data_raw, data_clean, scripts, figs, docs, src
  • Lots of research compendium resources here: https://research-compendium.science/
  • Modularize by splitting your long multi-step coding scripts into a set of scripts that are interrelated
  • Each project folder can be a git repo and/or an RStudio project
  • Folders turned into an RStudio project will have a .Rproj file in it
  • Blog post by Jenny Bryan on RStudio Projects and why not to use setwd in your R scripts: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
  • Turn off saving the workspace: Tools --> Global Options --> Basic --> Workspace section, then uncheck the “Restore .RData into workspace at startup” and turn “Save workspace to .RData on exit” to “Never”

9/22 Pushing to upstream repos in GitHub, Functions and If Statements

  • git fetch upstream - but upstream doesn't exist!
  • git remote add upstream git@github.com:cct-datascience/plant-research-compendium-ex.git to connect local repo with the upstream / cct-datascience copy
  • git fetch upstream bring down the changes from the upstream repo
  • git log upstream/master ^master see commits from upstream master not in local master
  • git merge upstream/master commits the commits to your local master (or whichever branch you are on)
  • git push origin master put local updates to origin
  • Also now an option to 'sync fork' on GitHub - to sync upstream and origin, but would still need to pull down the changes to local
  • git checkout -b new-branch-name to create a new branch
  • If you find yourself copying + pasting code, could writing a custom function help?
  • Function outline:
  • function_name <- function(input) { output_value <- do_something(input) return(output_value) }
  • To comment or uncomment multiple lines: control or command + shift + c
  • Between curly brackets {} is plain R code
  • Between parentheses () are the arguments of a function, separated by commas
  • If statement outline: if (the conditional statement is TRUE) { do something }

9/28 Functions and For Loops

  • git branch to check name of branch
  • git push origin branchname to push your branch to your origin
  • click link in the message to start a pull request from your branch
  • alternatively, navigate to upstream repository on GitHub, click 'New pull request', click 'compare across forks'
  • For loops and apply: you may need this if you're copying and pasting, then changing a few arguments. Some learning curve, but much less prone to error.
  • Example structure of for loop: for (item in list_of_items) { do_something(item) }
  • Can have multiple steps within the curly brackets
  • Objects inside for loop are saved - usually overwritten so reflects the most recent value
  • for loops can be slow, but are often easier to debug where the error occurs
  • loops can be designed for each value directly or their index
  • Example of index-based for loop: for (i in 1:length(vector)) { do_something(vector[i]) }
  • Index-based are suited for pre-defining the output dimensions, which uses less memory than ...
  • "Growing" the vector or dataframe with each iteration of the loop, which aligns more with value-based method (but for smaller tasks it doesn't make a practical difference)
  • vectorized functions - can run on scalars and vectors. Many R operations/functions are already vectorized!
  • if/else conditionals are not vectorized, so a function with an if/else conditional won't run on a vector automatically
  • Such a function can be placed inside a for loop or used with apply() to make it work on a vector
  • install.packages('stringr') , library(stringr)
  • apply() family of functions includes sapply(), lapply(), tapply(), and mapply(), and can vectorize a custom function without having to use a for loop
  • sapply() takes the object you want to run the function on and the function you wish to use. Output is simplified to a vector or matrix (rather than list)
  • lappply() does the same, but output is in a list
  • tapply() has an additional argument of grouping vector and applies functions by group
  • mapply() runs functions that have multiple arguments that vary (other apply functions work with multiple arguments, but only the first argument of the function can vary). Function is the first argument
  • map() from the 'purrr' package is based on mapply(), but part of the tidyverse!

10/3 Session Notes

  • Session #9 notes on ggplot2:
  • This plotting tool can only use dataframes as input
  • Overall format of code to create a plot using ggplot
  • ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
  • Chaining together pieces uses a + , as opposed to the pipe for dplyr
  • First argument to ggplot is dataframe to plot, then mapping specifies what goes on what axes; then requires at least one geom_* function
  • Determine what type of plot it is (scatterplot, histogram, etc.) with geom_* function; reference list of different possible plots: https://ggplot2.tidyverse.org/reference/
  • Can save out a plot using assignment operator, e.g.,
  • surveys_plot <- ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) + geom_point()
  • We often start out creating plot interactively before saving it as an object
  • Can use pipes with ggplot, such as piping a dataframe into a ggplot call. Can do intermediate data cleaning steps before the plotting part. Can only do this with one dataframe though.
  • surveys %>% ggplot(aes(x = weight, y = hindfoot_length) + geom_point()
  • To change appearance of all points, can add arguments to geom_* such as alpha (for opacity) or color.
  • Can change appearance of points by groups in a column, can add color = col_name argument to aes function
  • Can choose colors using scale_color_* type functions

Session Notes on dpylr

  • If you want to filter on multiple options for a particular column, use %in% with a vector of options instead of ==
  • Instead of nesting functions to do multiple data cleaning steps, e.g., filter(sort(df, -col), co > x), or saving out intermediate data object between each step, can use pipe %>%
  • Pipe avoids having to keep track of lots of different intermediate objects and having multiple nested functions that are hard to read
  • Order goes dataframe, pipe, cleaning step 1, pipe, cleaning step 2, pipe, rinse and repeat; see example below
  • surveys %>% select(-day) %>% filter(month >= 7)
  • Pipe goes at the end of each line
  • Keyboard shortcut for %>% is ctrl+shift+M (windows) and cmd+shift+M (apple)
  • Save out final dataframe after cleaning steps with assignment operator, e.g.,
  • surveys_sub <- surveys %>% select(-day) %>% filter(month >= 7)
  • Pay attention to the order you’re doing steps in; if you change the dataframe in one step, it could affect the next step
  • Use mutate to make new columns
  • Format is mutate(new_col_name = operations on one or more old_col)
  • R package udunits is useful for converting between units: https://www.unidata.ucar.edu/software/udunits/
  • udunits2::ud.convert(col, "current_unit", "new_unit")
  • Use relocate to move around order of columns; can use .before and .after arguments to specify where columns go
  • Can pipe within a dplyr function, e.g., mutate(date = paste(year, month, day, sep = "-") %>% as.Date())
  • split-apply-combine approach is splitting up dataframe by groups and doing some operation on each group
  • Use functions group_by and summarize ; first function will be on a column or columns, then summarize will create a new column based on operation, e.g.,
  • surveys %>% group_by(species_id, sex) %>% summarize(mean_weight = mean(weight, na.rm = T))
  • Use arrange to sort rows of a dataframe by a column in ascending order, or arrange(desc()) for descending order
  • Can group by and then summarize but retain all of the original dataframe columns, with the new operation’s column, by doing group_by and then mutate, with desired operation in the latter function
  • Use ungroup when you want to do some calculations on the entire dataframe after doing a grouping
  • Converting dataframes between wide and long formats; recording data is usually easier to do wide, but some data tools like ggplot2 work better with long format data
  • pivot_wider(dataframe, names_from = col_names, values_from = col_names), where first argument are columns that are to be turned into new column names, and values_from are what those new columns will be populated by, e.g.,
  • sp_by_plot_wide <- sp_by_plot %>% pivot_wider(names_from = plot_id, values_from = mean_weight)
  • pivot_longer(dataframe, cols = col_names, names_to = "new col names", values_to = ""), where first argument are which columns to turn into one new column, second argument is what to name new column of column names, and last argument is what the name of the new values column is, e.g., (note that this discludes the species_id column, instead of specifying which columns to include)
  • sp_by_plot_wide %>% pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")
  • Animations of pivoting under the “Tidy Data” section of this GitHub repo: https://github.com/gadenbuie/tidyexplain

10/6 Session Notes

  • Markdown is a text formatting tool, basis of lots of docs, renders plaintext to HTML
  • Easy to edit and read, can be tracked by version control
  • Rmarkdown is a tool to integrate rendered text and executed R code ('literate programming')
  • Use for more comprehensive documentation or reporting
  • Even used for manuscripts and posters!
  • Top of RMarkdown is a yaml header
  • Good resource: https://bookdown.org/yihui/rmarkdown/
  • Text in RMarkdown is written in Markdown, renders upon clicking 'Knit'
  • Can add R code chunks by clicking from the green button, or by typing

or option + command + i

  • R code goes between the two sets of ```
  • In the console, R code chunks use your existing working directory. When knitting, the Rmd assumes the working directory is the folder where the Rmd is saved.
  • tldr; connection errors can result from different working directories
  • R code chunks have many options that can be added to the {r}
  • {r, echo = FALSE}
  • chunks can be named for easier navigation in a long Rmd
  • https://yihui.org/knitr/options/
  • R code can be integrated into text, e.g., r length(unique(data$species_id))
  • Equations can be rendered in the text sections between $'s, e.g, $\alpha = 0.05$
  • Visual mode renders the text sections, a good way to get feedback and improve familiarity with Markdown
  • A Render tab opens when knitting, akin to a new console
  • One benefit of knitting to html is publishing, either to RStudio Connect (UA has a license, need to register) or to Rpubs
  • README's describe project as a whole, often a .md (Markdown) file
  • Typically saved in root project folder as README.md
  • Standard sections can differ between research compendia vs. software