-
Notifications
You must be signed in to change notification settings - Fork 0
Reproducibility in R
In this workshop we worked on setting up Git Bash in RStudio and learning to navigate the command line in the terminal within R. One of the outcomes from this workshop was learning how to navigate different files and determining which folders on our own local computer we might use the most often. I assume that my most frequent folders would be:
- c/users/erika/Documents/R for learning R commands and for documenting scripts
- c/users/erika/Documents/Work for data and files related to work tasks and projects
- c/users/erika/Documents/Dissertation Work for data and files related to my thesis work
Shell commands are really helpful for learning how to navigate folders in R. Here are some important commands I used:
-
pwd
: command to print your current working directory -
cd
: change directory/folder -
ls
: list the files and folders in your current directory - Commands have optional arguments that you can add to the command with a dash
- Find manual page for commands by doing
[command] --help
orman [command]
(exit man by hitting q) - Download shell-data folder and move to your Desktop folder
-
ls [folder]
will list contents of a subfolder from your current location - Unzip shell-data folder using
cd Desktop
thenunzip shell-lesson-data.zip
- File and folder names should have no spaces in them; use dashes or underscores instead
- Can use
cd
multiple times to move further into subfolders or put together a path likecd Desktop/project
- Absolute paths start with a
/
and work from anywhere on your computer; relative paths depend on where you are currently located - An absolute path is like the address for a building, while relative paths are like directions from where you’re at to another location
- Move up one folder level using
cd ..
-
.
is shorthand for your current folder and..
is shorthand for the folder above your current folder
For secure data, you can store that in a local machine that all the users can access (or BoxHealth), but you can use Git and GitHub for storage of code, and then change the file path. In general, GitHub is not a data repository because there are size limits for public repositories (private is unlimited). In git, there are three parts: saving the data, staging the information to a staging area, and then committing the change to a repository
Here are some helpful commands I learned and notes for this session
- When you start git for the first time, you need to configure your git username and email (only need to do this once per computer):
git config --global user.name "[name]"
,git config --global user.email "[GitHub email]"
- Check current git config settings with
git config --list
- One main folder per project, which we’ll turn into a git repository now and later an RStudio project
- Create new folder with
mkdir [directory name]
- From within a project folder, turn it into a git repo using
git init
- Hidden files have names that begin with a period, can see hidden files using
ls -a
- Can see info about your git repo using
git status
- History of moving from master to main terminology: https://www.jumpingrivers.com/blog/git-moving-master-to-main/
- Move files to the stage with
git add [file name]
- Commit the staged files with
git commit -m "[message]"
- Each commit represents a chunk of work, or a new version of your files
- Create a new R script, save it (Jessica recommends using a numbered system for file naming!), and do
git status
to see what has changed in repo -
git add 01-clean-soil-data.R
andgit commit -m "Initialize cleaning script"
- Commit messages usually start with a present tense verb
- You can add multiple files with
git add
to include multiple files in a particular commit - Can get list of commits using
git log
- To go back to an original commit you can use
git log
and copy the number for the commit, then usegit diff [number]
to go back to a previous version
We learned the difference between Git (which we learned last time), and GitHub. They shared about the importance of GitHub again in order to share projects publicly, collaborate with others, displays commits and logs much better than the log in Git. We followed this tutorial from the SW Carpentry
Here are some helpful commands I learned and notes for this session:
- Moving files with
mv
, first argument is path to file to be moved, second argument is path to where it should be moved to - Create
.gitignore
as a text file, type paths/names of files you do NOT want git to track - Must be named
.gitignore
- Should be located in the main repository folder
- Can list files by name, entire directories, wildcard with file extensions (e.g., *.pdf)
- Exceptions to wildcard can be made with ! in front of a particular path/file name
- Create file with
touch
and the file name/extension that you want - Remove files with
rm
and file name - Security keys (ssh keys) on your computer, share public key to GitHub
- To check
.ssh/ folder, ls -al ~/.ssh
-
ssh-keygen -t ed25519 -C [youremailaddress@yourdomain.edu]
or can name something like "Jessica@UA_laptop" - Add or don't add password
cat ~/.ssh/id_ed25519.pub
- Copy and paste output (with right click on a Windows) to a new SSH key on GitHub; name something that identifies which computer/machine the private key is located on, e.g. "Jessica@UA_laptop"
-
git remote add origin git@github.com:username/reponame.git
to add the connection to the repo name *git remote -v
to check that your remotes are set up correctly -
git push origin main/master
to amend to GitHub repo
When you're ready to push changes to GitHub, you need to send a pull request and then update your local computer. Here is a tutorial to help.
- Make sure you're in the right directory
cd ./pilot-analyses
- Session #4 notes (branches and forking with GitHub) from Jessica
- No matter what, you need to delete your branch on GitHub AND on your local
-
git pull
to pull changes down from GitHub -
git branch
to see what branches there are -
git checkout -b
make new branch and move there -
git log
to check commits, hit q to close out of the log -
git checkout branch-name
to move between branches -
git push origin branch-name
to push a branch to GitHub -
git checkout main/master
,git merge branch-name
to merge branch into main/master -
git branch -d branch-name
to delete (after merging) -
git push origin main/master
to push the merged main branch to GitHub -
git branch -D branch-name
to delete (even if commits are not merged) - After branch is pushed, can open pull request (PR)
- PR can be merged on GitHub
-
git checkout main/master
,git pull origin main/master
to pull changes from GitHub to our local repo - Fork button to make your own copy of the repository on your GitHub account
-
pwd
andcd
to move to directory where you want to make a local copy of the GitHub repo -
git clone git@github.com:username/reponanme
to clone (make local copy) to your computer -
cd plant-research-compendium-ex
to get into the new repo
- upstream means another persons GitHub project, origin is your own personal GitHub, local is our local computer
- if you're producing a lot of figures or datasets, you might not want Git to track them, so you could include them in .gitignore so that it doesn't need to be tracked along commits. Then when you're ready to share final files, you could remove them from .gitignore
- modularizing your code into smaller scripts (clean, plot, model, etc.) can help readability rather than keeping it all in one script
- You usually only
fork
a repo for yourself once, and thenclone
it to get a local copy once per computer - Can update local repo due to changes in upstream repo using
git fetch
andgit merge
(we’ll do this in a later session) - Research compendium: a way of organizing your research projects to make them more reproducible and understandable
- Example folders that can go in compendium include
data_raw
,data_clean
,scripts
,figs
,docs
,src
- Lots of research compendium resources here: https://research-compendium.science/
- Modularize by splitting your long multi-step coding scripts into a set of scripts that are interrelated
- Each project folder can be a git repo and/or an RStudio project
- Folders turned into an RStudio project will have a .Rproj file in it
- Blog post by Jenny Bryan on RStudio Projects and why not to use
setwd
in your R scripts: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/ - Turn off saving the workspace: Tools --> Global Options --> Basic --> Workspace section, then uncheck the “Restore .RData into workspace at startup” and turn “Save workspace to .RData on exit” to “Never”
-
git fetch upstream
- but upstream doesn't exist! -
git remote add upstream git@github.com:cct-datascience/plant-research-compendium-ex.git
to connect local repo with the upstream / cct-datascience copy -
git fetch upstream
bring down the changes from the upstream repo -
git log upstream/master ^master
see commits from upstream master not in local master -
git merge upstream/master
commits the commits to your local master (or whichever branch you are on) -
git push origin master
put local updates to origin - Also now an option to 'sync fork' on GitHub - to sync upstream and origin, but would still need to pull down the changes to local
-
git checkout -b new-branch-name
to create a new branch - If you find yourself copying + pasting code, could writing a custom function help?
- Function outline:
-
function_name <- function(input) { output_value <- do_something(input) return(output_value)
} - To comment or uncomment multiple lines: control or command + shift + c
- Between curly brackets {} is plain R code
- Between parentheses () are the arguments of a function, separated by commas
- If statement outline:
if (the conditional statement is TRUE) { do something }
-
git branch
to check name of branch -
git push origin branchname
to push your branch to your origin - click link in the message to start a pull request from your branch
- alternatively, navigate to upstream repository on GitHub, click 'New pull request', click 'compare across forks'
- For loops and apply: you may need this if you're copying and pasting, then changing a few arguments. Some learning curve, but much less prone to error.
- Example structure of for loop:
for (item in list_of_items) { do_something(item) }
- Can have multiple steps within the curly brackets
- Objects inside for loop are saved - usually overwritten so reflects the most recent value
- for loops can be slow, but are often easier to debug where the error occurs
- loops can be designed for each value directly or their index
- Example of index-based for loop:
for (i in 1:length(vector)) { do_something(vector[i]) }
- Index-based are suited for pre-defining the output dimensions, which uses less memory than ...
- "Growing" the vector or dataframe with each iteration of the loop, which aligns more with value-based method (but for smaller tasks it doesn't make a practical difference)
- vectorized functions - can run on scalars and vectors. Many R operations/functions are already vectorized!
- if/else conditionals are not vectorized, so a function with an if/else conditional won't run on a vector automatically
- Such a function can be placed inside a for loop or used with
apply()
to make it work on a vector -
install.packages('stringr')
,library(stringr)
-
apply()
family of functions includessapply()
,lapply()
,tapply()
, andmapply()
, and can vectorize a custom function without having to use a for loop -
sapply()
takes the object you want to run the function on and the function you wish to use. Output is simplified to a vector or matrix (rather than list) -
lappply()
does the same, but output is in a list -
tapply()
has an additional argument of grouping vector and applies functions by group -
mapply()
runs functions that have multiple arguments that vary (other apply functions work with multiple arguments, but only the first argument of the function can vary). Function is the first argument -
map()
from the 'purrr' package is based onmapply()
, but part of the tidyverse!
- Session #9 notes on ggplot2:
- This plotting tool can only use dataframes as input
- Overall format of code to create a plot using ggplot
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
- Chaining together pieces uses a
+
, as opposed to the pipe for dplyr - First argument to
ggplot
is dataframe to plot, thenmapping
specifies what goes on what axes; then requires at least onegeom_*
function - Determine what type of plot it is (scatterplot, histogram, etc.) with
geom_*
function; reference list of different possible plots: https://ggplot2.tidyverse.org/reference/ - Can save out a plot using assignment operator, e.g.,
surveys_plot <- ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) + geom_point()
- We often start out creating plot interactively before saving it as an object
- Can use pipes with ggplot, such as piping a dataframe into a ggplot call. Can do intermediate data cleaning steps before the plotting part. Can only do this with one dataframe though.
surveys %>% ggplot(aes(x = weight, y = hindfoot_length) + geom_point()
- To change appearance of all points, can add arguments to
geom_*
such as alpha (for opacity) or color. - Can change appearance of points by groups in a column, can add
color = col_name
argument to aes function - Can choose colors using
scale_color_*
type functions
Session Notes on dpylr
- If you want to filter on multiple options for a particular column, use
%in%
with a vector of options instead of == - Instead of nesting functions to do multiple data cleaning steps, e.g.,
filter(sort(df, -col), co > x)
, or saving out intermediate data object between each step, can use pipe%>%
- Pipe avoids having to keep track of lots of different intermediate objects and having multiple nested functions that are hard to read
- Order goes dataframe, pipe, cleaning step 1, pipe, cleaning step 2, pipe, rinse and repeat; see example below
surveys %>% select(-day) %>% filter(month >= 7)
- Pipe goes at the end of each line
- Keyboard shortcut for
%>%
is ctrl+shift+M (windows) and cmd+shift+M (apple) - Save out final dataframe after cleaning steps with assignment operator, e.g.,
surveys_sub <- surveys %>% select(-day) %>% filter(month >= 7)
- Pay attention to the order you’re doing steps in; if you change the dataframe in one step, it could affect the next step
- Use
mutate
to make new columns - Format is
mutate(new_col_name = operations on one or more old_col)
- R package udunits is useful for converting between units: https://www.unidata.ucar.edu/software/udunits/
udunits2::ud.convert(col, "current_unit", "new_unit")
- Use
relocate
to move around order of columns; can use.before
and.after
arguments to specify where columns go - Can pipe within a dplyr function, e.g.,
mutate(date = paste(year, month, day, sep = "-") %>% as.Date())
- split-apply-combine approach is splitting up dataframe by groups and doing some operation on each group
- Use functions
group_by
andsummarize
; first function will be on a column or columns, then summarize will create a new column based on operation, e.g., surveys %>% group_by(species_id, sex) %>% summarize(mean_weight = mean(weight, na.rm = T))
- Use
arrange
to sort rows of a dataframe by a column in ascending order, orarrange(desc())
for descending order - Can group by and then summarize but retain all of the original dataframe columns, with the new operation’s column, by doing
group_by
and thenmutate
, with desired operation in the latter function - Use
ungroup
when you want to do some calculations on the entire dataframe after doing a grouping - Converting dataframes between wide and long formats; recording data is usually easier to do wide, but some data tools like ggplot2 work better with long format data
-
pivot_wider(dataframe, names_from = col_names, values_from = col_names)
, where first argument are columns that are to be turned into new column names, and values_from are what those new columns will be populated by, e.g., sp_by_plot_wide <- sp_by_plot %>% pivot_wider(names_from = plot_id, values_from = mean_weight)
-
pivot_longer(dataframe, cols = col_names, names_to = "new col names", values_to = "")
, where first argument are which columns to turn into one new column, second argument is what to name new column of column names, and last argument is what the name of the new values column is, e.g., (note that this discludes the species_id column, instead of specifying which columns to include) sp_by_plot_wide %>% pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")
- Animations of pivoting under the “Tidy Data” section of this GitHub repo: https://github.com/gadenbuie/tidyexplain
- Markdown is a text formatting tool, basis of lots of docs, renders plaintext to HTML
- Easy to edit and read, can be tracked by version control
- Rmarkdown is a tool to integrate rendered text and executed R code ('literate programming')
- Use for more comprehensive documentation or reporting
- Even used for manuscripts and posters!
- Top of RMarkdown is a yaml header
- Good resource: https://bookdown.org/yihui/rmarkdown/
- Text in RMarkdown is written in Markdown, renders upon clicking 'Knit'
- Can add R code chunks by clicking from the green button, or by typing
or option + command + i
- R code goes between the two sets of ```
- In the console, R code chunks use your existing working directory. When knitting, the Rmd assumes the working directory is the folder where the Rmd is saved.
- tldr; connection errors can result from different working directories
- R code chunks have many options that can be added to the
{r}
{r, echo = FALSE}
- chunks can be named for easier navigation in a long Rmd
- https://yihui.org/knitr/options/
- R code can be integrated into text, e.g.,
r length(unique(data$species_id))
- Equations can be rendered in the text sections between $'s, e.g,
$\alpha = 0.05$
- Visual mode renders the text sections, a good way to get feedback and improve familiarity with Markdown
- A Render tab opens when knitting, akin to a new console
- One benefit of knitting to html is publishing, either to RStudio Connect (UA has a license, need to register) or to Rpubs
- README's describe project as a whole, often a .md (Markdown) file
- Typically saved in root project folder as README.md
- Standard sections can differ between research compendia vs. software