Reproducibility in R

Reproducibility in R Workshop

9/6 Git Bash and Terminal Work

In this workshop we worked on setting up Git Bash in RStudio and learning to navigate the command line in the terminal within R. One of the outcomes from this workshop was learning how to navigate different files and determining which folders on our own local computer we might use the most often. I assume that my most frequent folders would be:

c/users/erika/Documents/R for learning R commands and for documenting scripts
c/users/erika/Documents/Work for data and files related to work tasks and projects
c/users/erika/Documents/Dissertation Work for data and files related to my thesis work

Shell commands are really helpful for learning how to navigate folders in R. Here are some important commands I used:

pwd: command to print your current working directory
cd: change directory/folder
ls: list the files and folders in your current directory
Commands have optional arguments that you can add to the command with a dash
Find manual page for commands by doing [command] --help or man [command] (exit man by hitting q)
Download shell-data folder and move to your Desktop folder
ls [folder] will list contents of a subfolder from your current location
Unzip shell-data folder using cd Desktop then unzip shell-lesson-data.zip
File and folder names should have no spaces in them; use dashes or underscores instead
Can use cd multiple times to move further into subfolders or put together a path like cd Desktop/project
Absolute paths start with a / and work from anywhere on your computer; relative paths depend on where you are currently located
An absolute path is like the address for a building, while relative paths are like directions from where you’re at to another location
Move up one folder level using cd ..
. is shorthand for your current folder and .. is shorthand for the folder above your current folder

9/8 Learning Git for Version Control

For secure data, you can store that in a local machine that all the users can access (or BoxHealth), but you can use Git and GitHub for storage of code, and then change the file path. In general, GitHub is not a data repository because there are size limits for public repositories (private is unlimited). In git, there are three parts: saving the data, staging the information to a staging area, and then committing the change to a repository

Here are some helpful commands I learned and notes for this session

When you start git for the first time, you need to configure your git username and email (only need to do this once per computer): git config --global user.name "[name]", git config --global user.email "[GitHub email]"
Check current git config settings with git config --list
One main folder per project, which we’ll turn into a git repository now and later an RStudio project
Create new folder with mkdir [directory name]
From within a project folder, turn it into a git repo using git init
Hidden files have names that begin with a period, can see hidden files using ls -a
Can see info about your git repo using git status
History of moving from master to main terminology: https://www.jumpingrivers.com/blog/git-moving-master-to-main/
Move files to the stage with git add [file name]
Commit the staged files with git commit -m "[message]"
Each commit represents a chunk of work, or a new version of your files
Create a new R script, save it (Jessica recommends using a numbered system for file naming!), and do git status to see what has changed in repo
git add 01-clean-soil-data.R and git commit -m "Initialize cleaning script"
Commit messages usually start with a present tense verb
You can add multiple files with git add to include multiple files in a particular commit
Can get list of commits using git log
To go back to an original commit you can use git log and copy the number for the commit, then use git diff [number] to go back to a previous version

9/13 Sharing with GitHub

We learned the difference between Git (which we learned last time), and GitHub. They shared about the importance of GitHub again in order to share projects publicly, collaborate with others, displays commits and logs much better than the log in Git. We followed this tutorial from the SW Carpentry

Here are some helpful commands I learned and notes for this session:

Moving files with mv, first argument is path to file to be moved, second argument is path to where it should be moved to
Create .gitignore as a text file, type paths/names of files you do NOT want git to track
Must be named .gitignore
Should be located in the main repository folder
Can list files by name, entire directories, wildcard with file extensions (e.g., *.pdf)
Exceptions to wildcard can be made with ! in front of a particular path/file name
Create file with touch and the file name/extension that you want
Remove files with rm and file name
Security keys (ssh keys) on your computer, share public key to GitHub
To check .ssh/ folder, ls -al ~/.ssh
ssh-keygen -t ed25519 -C [youremailaddress@yourdomain.edu] or can name something like "Jessica@UA_laptop"
Add or don't add password
cat ~/.ssh/id_ed25519.pub
Copy and paste output (with right click on a Windows) to a new SSH key on GitHub; name something that identifies which computer/machine the private key is located on, e.g. "Jessica@UA_laptop"
git remote add origin git@github.com:username/reponame.git to add the connection to the repo name * git remote -v to check that your remotes are set up correctly
git push origin main/master to amend to GitHub repo

9/15 Branching and Collaborating on GitHub

When you're ready to push changes to GitHub, you need to send a pull request and then update your local computer. Here is a tutorial to help.

Make sure you're in the right directory cd ./pilot-analyses
Session #4 notes (branches and forking with GitHub) from Jessica
- No matter what, you need to delete your branch on GitHub AND on your local
git pull to pull changes down from GitHub
git branch to see what branches there are
git checkout -b make new branch and move there
git log to check commits, hit q to close out of the log
git checkout branch-name to move between branches
git push origin branch-name to push a branch to GitHub
git checkout main/master , git merge branch-name to merge branch into main/master
git branch -d branch-name to delete (after merging)
git push origin main/master to push the merged main branch to GitHub
git branch -D branch-name to delete (even if commits are not merged)
After branch is pushed, can open pull request (PR)
PR can be merged on GitHub
git checkout main/master , git pull origin main/master to pull changes from GitHub to our local repo
Fork button to make your own copy of the repository on your GitHub account
pwd and cd to move to directory where you want to make a local copy of the GitHub repo
git clone git@github.com:username/reponanme to clone (make local copy) to your computer
cd plant-research-compendium-ex to get into the new repo

9/20 Project Management

upstream means another persons GitHub project, origin is your own personal GitHub, local is our local computer
if you're producing a lot of figures or datasets, you might not want Git to track them, so you could include them in .gitignore so that it doesn't need to be tracked along commits. Then when you're ready to share final files, you could remove them from .gitignore
modularizing your code into smaller scripts (clean, plot, model, etc.) can help readability rather than keeping it all in one script
You usually only fork a repo for yourself once, and then clone it to get a local copy once per computer
Can update local repo due to changes in upstream repo using git fetch and git merge (we’ll do this in a later session)
Research compendium: a way of organizing your research projects to make them more reproducible and understandable
Example folders that can go in compendium include data_raw, data_clean, scripts, figs, docs, src
Lots of research compendium resources here: https://research-compendium.science/
Modularize by splitting your long multi-step coding scripts into a set of scripts that are interrelated
Each project folder can be a git repo and/or an RStudio project
Folders turned into an RStudio project will have a .Rproj file in it
Blog post by Jenny Bryan on RStudio Projects and why not to use setwd in your R scripts: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
Turn off saving the workspace: Tools --> Global Options --> Basic --> Workspace section, then uncheck the “Restore .RData into workspace at startup” and turn “Save workspace to .RData on exit” to “Never”

9/22 Pushing to upstream repos in GitHub, Functions and If Statements

git fetch upstream - but upstream doesn't exist!
git remote add upstream git@github.com:cct-datascience/plant-research-compendium-ex.git to connect local repo with the upstream / cct-datascience copy
git fetch upstream bring down the changes from the upstream repo
git log upstream/master ^master see commits from upstream master not in local master
git merge upstream/master commits the commits to your local master (or whichever branch you are on)
git push origin master put local updates to origin
Also now an option to 'sync fork' on GitHub - to sync upstream and origin, but would still need to pull down the changes to local
git checkout -b new-branch-name to create a new branch
If you find yourself copying + pasting code, could writing a custom function help?
Function outline:
function_name <- function(input) { output_value <- do_something(input) return(output_value) }
To comment or uncomment multiple lines: control or command + shift + c
Between curly brackets {} is plain R code
Between parentheses () are the arguments of a function, separated by commas
If statement outline: if (the conditional statement is TRUE) { do something }

9/28 Functions and For Loops

git branch to check name of branch
git push origin branchname to push your branch to your origin
click link in the message to start a pull request from your branch
alternatively, navigate to upstream repository on GitHub, click 'New pull request', click 'compare across forks'
For loops and apply: you may need this if you're copying and pasting, then changing a few arguments. Some learning curve, but much less prone to error.
Example structure of for loop: for (item in list_of_items) { do_something(item) }
Can have multiple steps within the curly brackets
Objects inside for loop are saved - usually overwritten so reflects the most recent value
for loops can be slow, but are often easier to debug where the error occurs
loops can be designed for each value directly or their index
Example of index-based for loop: for (i in 1:length(vector)) { do_something(vector[i]) }
Index-based are suited for pre-defining the output dimensions, which uses less memory than ...
"Growing" the vector or dataframe with each iteration of the loop, which aligns more with value-based method (but for smaller tasks it doesn't make a practical difference)
vectorized functions - can run on scalars and vectors. Many R operations/functions are already vectorized!
if/else conditionals are not vectorized, so a function with an if/else conditional won't run on a vector automatically
Such a function can be placed inside a for loop or used with apply() to make it work on a vector
install.packages('stringr') , library(stringr)
apply() family of functions includes sapply(), lapply(), tapply(), and mapply(), and can vectorize a custom function without having to use a for loop
sapply() takes the object you want to run the function on and the function you wish to use. Output is simplified to a vector or matrix (rather than list)
lappply() does the same, but output is in a list
tapply() has an additional argument of grouping vector and applies functions by group
mapply() runs functions that have multiple arguments that vary (other apply functions work with multiple arguments, but only the first argument of the function can vary). Function is the first argument
map() from the 'purrr' package is based on mapply(), but part of the tidyverse!

10/3 Session Notes

Session #9 notes on ggplot2:
This plotting tool can only use dataframes as input
Overall format of code to create a plot using ggplot
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
Chaining together pieces uses a + , as opposed to the pipe for dplyr
First argument to ggplot is dataframe to plot, then mapping specifies what goes on what axes; then requires at least one geom_* function
Determine what type of plot it is (scatterplot, histogram, etc.) with geom_* function; reference list of different possible plots: https://ggplot2.tidyverse.org/reference/
Can save out a plot using assignment operator, e.g.,
surveys_plot <- ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) + geom_point()
We often start out creating plot interactively before saving it as an object
Can use pipes with ggplot, such as piping a dataframe into a ggplot call. Can do intermediate data cleaning steps before the plotting part. Can only do this with one dataframe though.
surveys %>% ggplot(aes(x = weight, y = hindfoot_length) + geom_point()
To change appearance of all points, can add arguments to geom_* such as alpha (for opacity) or color.
Can change appearance of points by groups in a column, can add color = col_name argument to aes function
Can choose colors using scale_color_* type functions

Session Notes on dpylr

If you want to filter on multiple options for a particular column, use %in% with a vector of options instead of ==
Instead of nesting functions to do multiple data cleaning steps, e.g., filter(sort(df, -col), co > x), or saving out intermediate data object between each step, can use pipe %>%
Pipe avoids having to keep track of lots of different intermediate objects and having multiple nested functions that are hard to read
Order goes dataframe, pipe, cleaning step 1, pipe, cleaning step 2, pipe, rinse and repeat; see example below
surveys %>% select(-day) %>% filter(month >= 7)
Pipe goes at the end of each line
Keyboard shortcut for %>% is ctrl+shift+M (windows) and cmd+shift+M (apple)
Save out final dataframe after cleaning steps with assignment operator, e.g.,
surveys_sub <- surveys %>% select(-day) %>% filter(month >= 7)
Pay attention to the order you’re doing steps in; if you change the dataframe in one step, it could affect the next step
Use mutate to make new columns
Format is mutate(new_col_name = operations on one or more old_col)
R package udunits is useful for converting between units: https://www.unidata.ucar.edu/software/udunits/
udunits2::ud.convert(col, "current_unit", "new_unit")
Use relocate to move around order of columns; can use .before and .after arguments to specify where columns go
Can pipe within a dplyr function, e.g., mutate(date = paste(year, month, day, sep = "-") %>% as.Date())
split-apply-combine approach is splitting up dataframe by groups and doing some operation on each group
Use functions group_by and summarize ; first function will be on a column or columns, then summarize will create a new column based on operation, e.g.,
surveys %>% group_by(species_id, sex) %>% summarize(mean_weight = mean(weight, na.rm = T))
Use arrange to sort rows of a dataframe by a column in ascending order, or arrange(desc()) for descending order
Can group by and then summarize but retain all of the original dataframe columns, with the new operation’s column, by doing group_by and then mutate, with desired operation in the latter function
Use ungroup when you want to do some calculations on the entire dataframe after doing a grouping
Converting dataframes between wide and long formats; recording data is usually easier to do wide, but some data tools like ggplot2 work better with long format data
pivot_wider(dataframe, names_from = col_names, values_from = col_names), where first argument are columns that are to be turned into new column names, and values_from are what those new columns will be populated by, e.g.,
sp_by_plot_wide <- sp_by_plot %>% pivot_wider(names_from = plot_id, values_from = mean_weight)
pivot_longer(dataframe, cols = col_names, names_to = "new col names", values_to = ""), where first argument are which columns to turn into one new column, second argument is what to name new column of column names, and last argument is what the name of the new values column is, e.g., (note that this discludes the species_id column, instead of specifying which columns to include)
sp_by_plot_wide %>% pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")
Animations of pivoting under the “Tidy Data” section of this GitHub repo: https://github.com/gadenbuie/tidyexplain

10/6 Session Notes

Markdown is a text formatting tool, basis of lots of docs, renders plaintext to HTML
Easy to edit and read, can be tracked by version control
Rmarkdown is a tool to integrate rendered text and executed R code ('literate programming')
Use for more comprehensive documentation or reporting
Even used for manuscripts and posters!
Top of RMarkdown is a yaml header
Good resource: https://bookdown.org/yihui/rmarkdown/
Text in RMarkdown is written in Markdown, renders upon clicking 'Knit'
Can add R code chunks by clicking from the green button, or by typing

or option + command + i

R code goes between the two sets of ```
In the console, R code chunks use your existing working directory. When knitting, the Rmd assumes the working directory is the folder where the Rmd is saved.
tldr; connection errors can result from different working directories
R code chunks have many options that can be added to the {r}
{r, echo = FALSE}
chunks can be named for easier navigation in a long Rmd
https://yihui.org/knitr/options/
R code can be integrated into text, e.g., r length(unique(data$species_id))
Equations can be rendered in the text sections between $'s, e.g, $\alpha = 0.05$
Visual mode renders the text sections, a good way to get feedback and improve familiarity with Markdown
A Render tab opens when knitting, akin to a new console
One benefit of knitting to html is publishing, either to RStudio Connect (UA has a license, need to register) or to Rpubs
README's describe project as a whole, often a .md (Markdown) file
Typically saved in root project folder as README.md
Standard sections can differ between research compendia vs. software

Provide feedback

Saved searches

Use saved searches to filter your results more quickly