diff --git a/_site.yml b/_site.yml index 6d8e9871..99203eaa 100644 --- a/_site.yml +++ b/_site.yml @@ -37,4 +37,6 @@ navbar: href: home_precourse.html - text: Info href: home_info.html + - text: Projects + href: home_projects.html diff --git a/data/slide_programming/Data_Information_Knowledge.png b/data/slide_programming/Data_Information_Knowledge.png new file mode 100644 index 00000000..f9d4c696 Binary files /dev/null and b/data/slide_programming/Data_Information_Knowledge.png differ diff --git a/data/slide_programming/Data_classification.png b/data/slide_programming/Data_classification.png new file mode 100644 index 00000000..bdcab023 Binary files /dev/null and b/data/slide_programming/Data_classification.png differ diff --git a/data/slide_r_environment/ggplot2_CRAN.png b/data/slide_r_environment/ggplot2_CRAN.png new file mode 100644 index 00000000..0f0d0e20 Binary files /dev/null and b/data/slide_r_environment/ggplot2_CRAN.png differ diff --git a/home_content.Rmd b/home_content.Rmd index 43589f33..838c6e4a 100644 --- a/home_content.Rmd +++ b/home_content.Rmd @@ -37,6 +37,7 @@ This page contains links to different lectures (slides) and practical exercises * [Working with Vectors (Lab)](lab_vectors.html) * [Dataframes (Lab)](lab_dataframes.html) * [Loops and functions (Slides)](slide_r_elements_4.html) +* [Loops and functions (Lab)](lab_loops.html) **Data wrangling** diff --git a/home_precourse.Rmd b/home_precourse.Rmd index c564c6ec..e03c7be1 100644 --- a/home_precourse.Rmd +++ b/home_precourse.Rmd @@ -72,7 +72,7 @@ Extra R packages used in the workshop exercises (if any) are listed below. It is pkg<-unique(renv::dependencies()$Package) -pkg_discard<-c("mkteachr") +pkg_discard<-c("mkteachr", "manipulateWidget") pkg_list<-pkg[!pkg %in% pkg_discard] diff --git a/home_projects.Rmd b/home_projects.Rmd new file mode 100644 index 00000000..ffd8e8e0 --- /dev/null +++ b/home_projects.Rmd @@ -0,0 +1,224 @@ +--- +title: "Projects" +output: + bookdown::html_document2: + highlight: textmate + toc: false + toc_float: + collapsed: true + smooth_scroll: true + print: false + toc_depth: 4 + number_sections: false + df_print: default + code_folding: none + self_contained: false + keep_md: false + encoding: 'UTF-8' + css: "assets/lab.css" + include: + after_body: assets/footer-lab.html +--- + +```{r,child="assets/header-lab.Rmd"} +``` + +Hands-on analysis of actual data is the best way to learn R programming. This page contains some data sets that you can use to explore what you have learned in this course. For each data set, a brief description as well as download instructions are provided. + +
+ Try to focus on using the tools from the course to explore the data, rather than worrying about producing a perfect report with a coherent analysis workflow. +
+ + +On the last day you will present your Rmd file (or rather, the resulting html report) and share with the class what your data was about. + +--- + +## Palmer penguins 🐧 + +- This is a data set containing a series of measurements for three species of penguins collected in the Palmer station in Antarctica. +- Data description: + +
+ Download instructions +```{r, warning=F, message=F} +penguins <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/heplots/peng.csv", header = T, sep = ",") +str(penguins) +``` +
+ +--- + +## Drinking habits 🍷 + +- Data from a national survey on the drinking habits of american citizens in 2001 and 2002. +- Data description: + +
+ Download instructions +```{r} +library(dplyr) +# this will download the csv file directly from the web +drinks <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/nesarc_drinkspd.csv", header = T, sep = ",") +# the lines below will take a sample from the full data set +set.seed(seed = 2) +drinks <- sample_n(drinks, size = 3000, replace = F) +# and here we check the structure of the data +str(drinks) +``` +
+ +--- + +## Car crashes 🚗 + +- Data from car accidents in the US between 1997-2002. +- Data description: + +
+ Download instructions +```{r} +library(dplyr) +# this will download the csv file directly from the web +crashes <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/nassCDS.csv", header = T, sep = ",") +# the lines below will take a sample from the full data set +set.seed(seed = 2) +crashes <- sample_n(crashes, size = 3000, replace = F) +# and here we check the structure of the data +str(crashes) +``` +
+ +--- + +## Gapminder health and wealth 📈 + +- This is a collection of country indicators from the Gapminder dataset for the years 2000-2016. +- Data description: + +
+ Download instructions +```{r} +library(dplyr) +# this will download the csv file directly from the web +gapminder <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/dslabs/gapminder.csv", header = T, sep = ",") +# here we filter the data to remove anything before the year 2000 +gapminder <- gapminder |> filter(year >= 2000) +# and here we check the structure of the data +str(gapminder) +``` +
+ +--- + +## StackOverflow survey 🖥️ + +- This is a downsampled and modified version of one of StackOverflow's annual surveys where users respond to a series of questions related to careers in technology and coding. +- Data description: + +
+ Download instructions +```{r} +library(dplyr) +# this will download the csv file directly from the web +stackoverflow <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/stackoverflow.csv", header = T, sep = ",") +# the lines below will take a sample from the full data set +set.seed(2) +stackoverflow <- sample_n(stackoverflow, size = 3000) +# and here we check the structure of the data +str(stackoverflow) +``` +
+ +--- + +## Doctor visits 🤒 + +- Data on the frequency of doctor visits in the past two weeks in Australia for the years 1977 and 1978. +- Data description: + +
+ Download instructions +```{r} +library(dplyr) +# this will download the csv file directly from the web +doctor <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv", header = T, sep = ",") +# the lines below will take a sample from the full data set +set.seed(2) +doctor <- sample_n(doctor, size = 3000) +# and here we check the structure of the data +str(doctor) +``` +
+ +--- + +## Video Game Sales 🎮 + +- This data set contains sales figures for video games titles released in 2001 and 2002. +- Data description: + - Click on "Preview Data" and "VG Data Dictionary" to see the description for each column. + +
+ Download instructions +```{r, warning=F, message=F} +library(dplyr) +library(lubridate) +# this will download the file to your working directory +download.file(url = "https://maven-datasets.s3.amazonaws.com/Video+Game+Sales/Video+Game+Sales.zip", destfile = "video_game_sales.zip") +# this will unzip the file and read it into R +videogames <- read.table(unz(filename = "vgchartz-2024.csv", "video_game_sales.zip"), header = T, sep = ",", quote = "\"", fill = T) +# this will select rows corresponding to years 2001 and 2002 +videogames <- filter(videogames, year(as_date(release_date)) %in% c(2001,2002)) +# and here we check the structure of the data +str(videogames) +``` +
+ +--- + +## LEGO Sets 🏗️ + +- This data set contains the description of all LEGO sets released from 2000 to 2009. +- Data description: + - Click on "Preview Data" and "VG Data Dictionary" to see the description for each column. + +
+ Download instructions +```{r, warning=F, message=F} +library(dplyr) +# this will download the file to your working directory +download.file(url = "https://maven-datasets.s3.amazonaws.com/LEGO+Sets/LEGO+Sets.zip", destfile = "lego.csv.zip") +# this will unzip the file and read it into R +lego <- read.table(unz(filename = "lego_sets.csv", "lego.csv.zip"), header = T, sep = ",", quote = "\"", fill = T) +# this will select rows corresponding to years 2000-2009 +lego <- filter(lego, year %in% seq(2000,2009,1)) +# and here we check the structure of the data +str(lego) +``` +
+ +--- + +## Shark attacks 🦈 + +- This data set contains information on shark attack records from all over the world. +- Data description: + - Click on "Preview Data" and "VG Data Dictionary" to see the description for each column. + +
+ Download instructions +```{r, warning=F, message=F} +library(dplyr) +# this will download the file to your working directory +download.file(url = "https://maven-datasets.s3.amazonaws.com/Shark+Attacks/attacks.csv.zip", destfile = "attacks.csv.zip") +# this will unzip the file and read it into R +sharks <- read.table(unz(filename = "attacks.csv", "attacks.csv.zip"), header = T, sep = ",", quote = "\"", fill = T) +# the lines below will take a sample from the full data set +set.seed(seed = 2) +sharks <- sample_n(sharks, size = 3000, replace = F) +str(sharks) +``` +
+ +*** diff --git a/schedule.csv b/schedule.csv index 0e0dd773..8f6fb454 100644 --- a/schedule.csv +++ b/schedule.csv @@ -1,29 +1,37 @@ date;room;start_time;end_time;topic;teacher;assistant;link_slide;link_lab;link_room -23/10/2023;Tripplet room;09:00;09:15;Welcome;Nima;NR, PA;;; -;;09:15;09:30;Intro to R;Nima;NR, PA;slide_r_intro.html;; -;;09:30;10:00;Intro to R programming;Nima;NR, PA;slide_r_programming_1.html;; -;;10:15;10:45;Intro to R environment;Nima;NR, PA;slide_r_environment.html;; -;;11:00;12:00;Using Rstudio;Nima;NR, PA;;https://www.dropbox.com/s/3sy4ou2o8jh5syf/RCourseVideo.mov?dl=0; +28/10/2024;Experimental room;09:00;09:15;Welcome;Nima;;;; +;;09:15;10:00;Introduction;Nima;;slide_r_intro.html;; +;;10:00;11:00;Using Rstudio;Nima;;;https://youtu.be/suX6nsSUXDw?si=Vs1e22GU6UJ4Ty7u; +;;11:00;12:00;Essential: Variable & operators;Nima;;slide_r_elements_1.html;; ;;12:00;13:00;Lunch;;;;; -;;13:00;15:00;Variables & Operators;Nima;NR, PA, GD;slide_r_elements_1.html;; -;;15:00;17:00;Data types;Nima;NR, PA, GD;;lab_datatypes.html; -24/10/2023;Tripplet room;09:00;10:00;Vectors & Strings;Sebastian DiLorenzo;NR, PA, SD, GD;slide_r_elements_2.html;; -;;10:00;11:00;Matrices, Lists and Dataframes;Prasoon;NR, PA, SD, GD;slide_r_elements_3.html;; -;;11:00;12:00;Working with Vectors;Sebastian DiLorenzo;NR, PA, SD, GD;;lab_vectors.html; +;;13:00;13:15;Projects and group discussion;Guilherme;;;; +;;13:15;14:00;Essential: data types;Guilherme;;;lab_datatypes.html; +;;14:00;15:00;Essential: Vectors & Strings;Guilherme;;slide_r_elements_2.html;; +;;15:00;16:00;Essential: Working with Vectors;Guilherme;;;lab_vectors.html; +;;16:00;17:00;Group discussion on projects;;;;; +29/10/2024;Experimental room;09:00;10:00;Essential: Matrices, Lists and Dataframes;Guilherme;;slide_r_elements_3.html;; +;;10:00;11:00;Essential: Matrices, Lists and Dataframes;Guilherme;;;lab_dataframes.html; +;;11:00;12:00;Loading data into R;Guilherme;;slide_loading_data.html;; ;;12:00;13:00;Lunch;;;;; -;;13:00;17:00;Working with Matrices, Lists and Dataframes;Prasoon;NR, PA, SD;;lab_dataframes.html; -25/10/2023;Tripplet room;09:00;10:00;Loading data into R;Sebastian DiLorenzo;NR, PA, GD, SD;slide_loading_data.html;; -;;10:00;12:00;Loading data into R;Sebastian DiLorenzo;NR, PA, GD, SD;;lab_loadingdata.html; +;;13:00;15:00;Loading data into R;Guilherme;;;lab_loadingdata.html; +;;15:00;15:30;Essential: Basic statistics;Nima;;slide_r_basic_statistic.html;; +;;15:30;16:00;Essential: Basic statistics;Nima;;;; +;;16:00;17:00;Group discussion on projects;;;;; +30/10/2024;Experimental room;09:00;10:00;Essential: Loops, Conditionals, Functions;Miguel;;slide_r_elements_4.html;; +;;10:00;12:00;Essential: Loops, Conditionals, Functions;Miguel;;;lab_loops.html; ;;12:00;13:00;Lunch;;;;; -;;13:00;14:00;Control Structures, Iteration;Nima;NR, PA, GD;slide_r_elements_4.html;; -;;14:00;17:00;Loops, Conditionals, Functions;Nima;NR, PA, GD;;lab_loops.html; -26/10/2023;Tripplet room;09:00;10:00;Base graphics;Prasoon;PA, NR, GD;slide_base_graphics.html;; -;;10:00;12:00;Base graphics;Prasoon;PA, NR, GD;;lab_graphics.html; +;;13:00;14:00;Intro to Tidyverse;Marcin;;slide_tidyverse.html;; +;;14:00;16:00;Intro to Tidyverse;Marcin;;;lab_tidyverse.html; +;;16:00;17:00;Group discussion on projects;;;;; +31/10/2024;Experimental room;09:00;10:00;Base graphics;Nima;;slide_base_graphics.html;; +;;10:00;12:00;Base graphics;Nima;;;lab_graphics.html; ;;12:00;13:00;Lunch;;;;; -;;13:00;14:00;Intro to Tidyverse;Marcin Kierczak;MK, PA, GD, NR;slide_tidyverse.html;; -;;14:00;17:00;Intro to Tidyverse;Marcin Kierczak;MK, PA, GD, NR;;lab_tidyverse.html; -27/10/2023;Tripplet room;09:00;10:00;Graphics using ggplot2;Prasoon;PA, NR;slide_ggplot2.html;; -;;10:00;11:00;Topic of your interest;Nima/Prasoon;NR, PA;;; -;;11:00;12:00;Q&A;Nima/Prasoon;NR, PA, MR;;; +;;13:00;14:00;Graphics using ggplot2;Lokesh;;slide_ggplot2.html;; +;;14:00;16:00;Working with ggplot2;Lokesh;;;lab_ggplot2.html; +;;16:00;17:00;Group discussion on projects;;;;; +1/11/2024;Experimental room;09:00;10:00;Group discussion on projects;;;;; +;;10:00;12:00;Group discussion on projects;;;;; +;;11:00;12:00;Group discussion on projects;;;;; ;;12:00;13:00;Lunch;;;;; -;;13:00;16:00;Working with ggplot2;Prasoon;PA, NR, MR;;lab_ggplot2.html; \ No newline at end of file +;;13:00;14:30;Group presentation;;;;; +;;14:30;15:00;Q & A;;;;; \ No newline at end of file diff --git a/slide_loading_data.Rmd b/slide_loading_data.Rmd index c9c7bdd0..c492ae14 100644 --- a/slide_loading_data.Rmd +++ b/slide_loading_data.Rmd @@ -1,6 +1,6 @@ --- title: "Reading (and writing) data in R" -subtitle: "R Foundations for Life Scientists" +subtitle: "R Foundations for Data Analysis" author: "Marcin Kierczak" keywords: "bioinformatics, course, scilifelab, nbis, R" output: @@ -46,15 +46,9 @@ name: reading_data # Reading data --- - -* Reading data is one of the most consuming and most cumbersome aspects of bioinformatics... - --- - -* R provides a number of ways to read and write data stored on different media (file, database, url, twitter, Facebook, etc.) and in different formats. +* Can be one of the most consuming and cumbersome aspects of data analysis. --- +* R provides ways to read and write data stored on different media (e.g.: file, database, url) and in different formats. * Package `foreign` contains a number of functions to import less common data formats. @@ -63,11 +57,11 @@ name: reading_tables # Reading tables -Most often, we will use the `read.table()` function. It is really, really flexible and nice way to read your data into a data.frame structure with rows corresponding to observations and columns to particular variables. +We can use the `read.table()` function. It is a nice way to read your data into a data frame. The function is declared in the following way: -``` +```{r, echo=T, eval=F} read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, @@ -77,28 +71,33 @@ read.table(file, header = FALSE, sep = "", quote = "\"'", comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), - fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)* + fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) + +# or just +read.table(file) ``` --- name: read_table_params -# `read.table` parameters +# `read.table()` parameters + +You can read all about the `read.table()` function using `?read.table` -You can read more about the *read.table* function on its man page, but the most important arguments are: +The most important arguments are: -* file – the path to the file that contains data, -* header – a logical indicating whether the first line of the file contains variable names, -* sep – a character determining variable delimiter, e.g. comma for csv files, -* quote – a character telling R which character surrounds strings, -* dec – acharacter determining the decimal separator, -* row/col.names – vectors containing row and column names, -* na.strings – a character used for missing data, -* nrows – how many rows should be read, -* skip – how many rows to skip, -* as.is – a vector of logicals or numbers indicating which columns shall not be converted to factors, -* fill – add NA to the end of shorter rows, -* stringsAsFactors – a logical. Rather self explanatory. +* **file** – the path to the file that contains data, e.g. `/path/to/my/file.csv` +* **header** – a logical indicating whether the first line of the file contains variable names, +* **sep** – a character determining variable delimiter, e.g. `","` for csv files, +* **quote** – a character telling R which character surrounds strings, +* **dec** – character determining the decimal separator, +* **row/col.names** – vectors containing row and column names, +* **na.strings** – a character used for missing data, +* **nrows** – how many rows should be read, +* **skip** – how many rows to skip, +* **as.is** – a vector of logicals or numbers indicating which columns shall not be converted to factors, +* **fill** – add NA to the end of shorter rows, +* **stringsAsFactors** – a logical. Rather self explanatory. --- name: read_table_sibs @@ -130,37 +129,25 @@ name: handling_errors # What if you encounter errors? -* StackOverflow, +* R documentation `?` and `??` * Google – just type R and copy the error you got without your variable names, -* open the file – has the header line the same number of columns as the first line? -* in Terminal (on Linux/OsX) you can type some useful commands. +* Open the file using a text editor and see if you can spot anything unusual – + * e.g. has the header line the same number of columns as the first line? -- -# Useful commands for debugging - --- +# Useful terminal commands for debugging (Linux/OsX) * `cat phenos.txt | awk -F';' '{print NF}'` prints the number of words in each row. `-F';'` says that semicolon is the delimiter, --- - * `head -n 5 phenos.txt` prints the 5 first lines of the file, --- - * `tail -n 5 phenos.txt` prints the 5 last lines of the file, --- - * `head -n 5 phenos.txt | tail -n 2` will print lines 4 and 5... --- - * `wc -l phenos.txt` will print the number of lines in the file --- - * `head -n 2 phenos.txt > test.txt` will write the first 2 lines to a new file -- @@ -176,19 +163,21 @@ name: writing # Writing with `write.table()` -`read.table()` has its counterpart, the `write.table()` function (as well ass its siblings, like write.csv()). You can read more about it in the documentation, let us show some examples: +`read.table()` has its counterpart, the `write.table()` function (as well ass its siblings, like `write.csv()`). You can read more about it in the documentation, let us show some examples: ```{r write.table, echo=T, eval=F} vec <- rnorm(10) write.table(vec, '') # write to screen write.table(vec, file = 'vector.txt') + # write to the system clipboard, handy! -write.table(vec, 'clipboard', col.names=F, - row.names=F) +write.table(vec, 'clipboard', col.names=F, row.names=F) + # or on OsX clip <- pipe("pbcopy", "w") write.table(vec, file=clip) close(clip) + # To use in a spreadsheet write.csv(vec, file = 'spreadsheet.csv') ``` @@ -213,12 +202,12 @@ name: read_xls_matlab ```{r xls, eval=F, echo=T} library(readxl) -data <- readxl::read_xlsx('myfile.xlsx') +data <- read_xlsx('myfile.xlsx') ``` ```{r matlab, eval=F, echo=T} library(R.matlab) -data <- R.matlab::readMat("mydata.mat") +data <- readMat("mydata.mat") ``` --- diff --git a/slide_r_basic_statistic.Rmd b/slide_r_basic_statistic.Rmd new file mode 100644 index 00000000..3459694e --- /dev/null +++ b/slide_r_basic_statistic.Rmd @@ -0,0 +1,428 @@ +--- +title: "Brief introduction to statistics" +subtitle: "Statistics" +author: "Nima Rafati" +keywords: bioinformatics, course, scilifelab, nbis, R +output: + xaringan::moon_reader: + encoding: 'UTF-8' + self_contained: false + chakra: 'assets/remark-latest.min.js' + css: 'assets/slide.css' + lib_dir: libs + include: NULL + nature: + ratio: '4:3' + highlightLanguage: r + highlightStyle: github + highlightLines: true + countIncrementalSlides: false + slideNumberFormat: "%current%/%total%" +--- + +exclude: true +count: false + +```{r,echo=FALSE,child="assets/header-slide.Rmd"} +``` + + + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo=TRUE, width=60) +``` + +```{r,include=FALSE} +# load the packages you need +#library(dplyr) +#library(tidyr) +#library(stringr) +#library(ggplot2) +#library(mkteachr) +``` + +--- +name: intro + +# Introduction + +**Why do we need statistics in our analysis?** + +- Make data understandable and insightful. + +- Evaluate patterns and trends. + +- Identify and quantify differences/similarities between groups. + + +-- + + +**Types of statistics:** + +- Descriptive statistics: To summarize and describe main features of a dataset (Mean, median,...). + +- Inferential statistics: To make prediction or inferences about a population using a sample of data (Hypothesis testing, regression analysis,...). + +- Predictive statistics: To make predictions about future outcomes based on collected data (Regression models, time series forecasting, machine learning,...). + +- ...... + + +--- +name: Descriptive +# Types of Descriptive Statistics + +Descriptive statistics helps to: + +- Summarize and describe the data. + +- Visualize the data. + +- Identify patterns (trends) and outliers in the data. + +- Provide insights for downstream-analysis. + +--- +name: SomeStats +# Some of the basic descriptive statistics + +1. **Measures of Central Tendency** + - Mean, Median, Mode. +2. **Measures of Spread** + - Range, Interquartile Range, Standard Deviation, Variance. +3. **Correlation** + - Relation between two variables (e.g. Pearson's correlation). + +--- +name: Mean +# Central Tendency: Mean +- Mean: The average value of data. +$$ +\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i +$$ + +```{r Mean, eval = T, echo = F, fig.width = 10, fig.height=4} +set.seed(123) +par(mfrow = c(1, 2), mar = c(5, 4, 4, 2) + 0.1) +data <- data.frame( var1 = rgamma(10000, shape = 2, scale = 2) * 12, + var2 = rnorm(10000, mean = 100, sd = 20)) +hist(data$var1,breaks = 50, main = 'var1 distribution', xlab = 'var1', col = 'skyblue', freq = TRUE) +var1_mean = mean(data$var1) +# Mean +abline(v = var1_mean, col = 'red', lwd = 2) +text(x = var1_mean + 10 , y = 700, labels = paste("Mean =", round(var1_mean, 2)), pos = 4, col = 'red', cex = 0.8) + +hist(data$var2,breaks = 50, main = 'var2 distribution', xlab = 'var2', col = 'skyblue', freq = TRUE) +var2_mean = mean(data$var2) +var2_median = median(data$var2) +# Mean +abline(v = var2_mean, col = 'red', lwd = 2) +text(x = var2_mean + 10 , y = 700, labels = paste("Mean =", round(var2_mean, 2)), pos = 4, col = 'red', cex = 0.8) +``` + +```{r mean, eval = T, echo = T} +mean(data$var1) +mean(data$var2) +``` + + +--- +name: Median + +# Central Tendency: Median + +- Median: The middle value with the data is sorted. +```{r Median, eval = T, echo = F, fig.width = 10, fig.height=5} +par(mfrow=c(1,2)) +hist(data$var1,breaks = 50, main = 'var1 distribution', xlab = 'var1', col = 'skyblue', freq = TRUE) +var1_mean = mean(data$var1) +var1_median = median(data$var1) +# Mean +abline(v = var1_mean, col = 'red', lwd = 2) +text(x = var1_mean + 10 , y = 400, labels = paste("Mean =", round(var1_mean, 2)), pos = 4, col = 'red') +# Median +abline(v = var1_median, col = 'green', lwd = 2) +text(x = var1_median + 10 , y = 500, labels = paste("Median =", round(var1_median, 2)), pos = 4, col = 'green') + +hist(data$var2,breaks = 50, main = 'var2 distribution', xlab = 'var2', col = 'skyblue', freq = TRUE) +var2_mean = mean(data$var2) +var2_median = median(data$var2) +# Mean +abline(v = var2_mean, col = 'red', lwd = 2) +text(x = var2_mean + 30 , y = 400, labels = paste("Mean =", round(var2_mean, 2)), pos = 4, col = 'red') +# Median +abline(v = var2_median, col = 'green', lwd = 2) +text(x = var2_median + 30 , y = 600, labels = paste("Median =", round(var2_median, 2)), pos = 4, col = 'green') +``` + +```{r} +median(data$var1) +median(data$var2) +``` + +--- +name: Mode +# Central Tendency: Mode + +- Mode: The most frequently occurring value. +```{r Mode-plot, eval = T, echo = F, fig.width = 10, fig.height=5} +par(mfrow=c(1,2)) +hist(data$var1,breaks = 50, main = 'var1 distribution', xlab = 'var1', col = 'skyblue', freq = TRUE) +var1_mean = mean(data$var1) +var1_median = median(data$var1) +# Mean +abline(v = var1_mean, col = 'red', lwd = 2) +text(x = var1_mean + 10 , y = 400, labels = paste("Mean =", round(var1_mean, 2)), pos = 4, col = 'red') +# Median +abline(v = var1_median, col = 'green', lwd = 2) +text(x = var1_median + 10 , y = 500, labels = paste("Median =", round(var1_median, 2)), pos = 4, col = 'green') +# Mode +density_data <- density(data$var1) +var1_mode <- density_data$x[which.max(density_data$y)] +abline(v = var1_mode, col = 'purple', lwd = 2) +text(x = var1_mode + 10 , y = 600, labels = paste("Mode =", round(var1_mode, 2)), pos = 4, col = 'purple') + + +hist(data$var2,breaks = 50, main = 'var2 distribution', xlab = 'var2', col = 'skyblue', freq = TRUE) +var2_mean = mean(data$var2) +var2_median = median(data$var2) +# Mean +abline(v = var2_mean, col = 'red', lwd = 2) +text(x = var2_mean + 30 , y = 400, labels = paste("Mean =", round(var2_mean, 2)), pos = 4, col = 'red') +# Median +abline(v = var2_median, col = 'green', lwd = 2) +text(x = var2_median + 30 , y = 600, labels = paste("Median =", round(var2_median, 2)), pos = 4, col = 'green') +# Mode +density_data <- density(data$var2) +var2_mode <- density_data$x[which.max(density_data$y)] +abline(v = var2_mode, col = 'purple', lwd = 2) +text(x = var2_mode - 90 , y = 600, labels = paste("Mode =", round(var2_mode, 2)), pos = 4, col = 'purple') +``` + +```{r mode, echoo = T, eval = T} +mode(data$var1) +mode(data$var2) +``` +--- +name: Spread +# Measures of spread: Range and Interquartile Range. +- Range: Difference between maximum `max(data$var2)` and minimum `min(data$var2)`. +- Interquartile Range: Data is represented in four equally sized groups (bins) known as **Quartile** and the distance between quartile is called **Interquartile Range** (IQR). + +```{r range, echo = F, eval = T} +# Sample data +set.seed(123) +data_quartile <- c(24, 30, 33, 45, 47, 58, 60, 66, 70) + +# Calculate min, Q1, Q2 (median), Q3, max, IQR, and range +min_val <- min(data_quartile) +q1 <- quantile(data_quartile, 0.25) +median_val <- median(data_quartile) +q3 <- quantile(data_quartile, 0.75) +max_val <- max(data_quartile) +iqr_val <- IQR(data_quartile) +range_val <- max_val - min_val + +# Plot the main line and quartiles +plot(c(1, 9), c(0, 1), type = "n", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n") + +# Main line (the range of the data) +segments(1, 0.5, 9, 0.5, lwd = 2) + +# Draw vertical lines at min, Q1, median (Q2), Q3, max +segments(1, 0.45, 1, 0.55, lwd = 2) # Min +segments(3, 0.45, 3, 0.55, lwd = 2, col = "orange") # Q1 +segments(5, 0.45, 5, 0.55, lwd = 2, col = "red") # Q2 (Median) +segments(7, 0.45, 7, 0.55, lwd = 2, col = "orange") # Q3 +segments(9, 0.45, 9, 0.55, lwd = 2) # Max + +# Add the values on top +text(1, 0.6, min_val, cex = 1) +text(3, 0.6, q1, cex = 1) +text(5, 0.6, median_val, cex = 1, col = "red") +text(7, 0.6, q3, cex = 1) +text(9, 0.6, max_val, cex = 1) + +# Add labels for Min, Q1, Q2, Q3, Max +text(1, 0.4, "Min", cex = 1, col = "blue") +text(3, 0.4, "Q1", cex = 1, col = "blue") +text(5, 0.4, "Q2", cex = 1, col = "blue") +text(7, 0.4, "Q3", cex = 1, col = "blue") +text(9, 0.4, "Max", cex = 1, col = "blue") + +# Add the IQR and Range arrows and labels +arrows(3, 0.3, 7, 0.3, length = 0.1) +text(5, 0.25, paste("IQR = Q3 - Q1 =", round(iqr_val, 2)), cex = 1) + +arrows(1, 0.2, 9, 0.2, length = 0.1) +text(5, 0.15, paste("Range = Max - Min =", range_val), cex = 1) + +``` + +--- +name: Variance +# Measures of spread: Variance + +- Variance: How far the data points are spread out from the mean. Unit is the square of the data's unit (e.g. $cm^2$ ). + +$$ +\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 +$$ +```{r var, echo=TRUE, eval=TRUE} +var(data$var2) +``` +--- +name: Stdev +# Measures of spread: Standard deviation + +- Standard deviation (sd): is the square root of the variance and provides a more intuitive measure of spread. Despite of variance, sd has the same unit as the data (e.g. cm). + +$$ +\sigma =\sqrt{\sigma2} +$$ + +```{r sd-plot,echo = F, eval = T} +var2_sd <- sd(data$var2) +hist(data$var2,breaks = 50, main = 'var2 distribution', xlab = 'var2', col = 'skyblue', freq = TRUE, ylim = c(0,1200)) +abline(v = var2_mean, col = 'red', lwd = 2) +rect(var2_mean - var2_sd, 0, var2_mean + var2_sd, 1100, col = rgb(0.9, 0.9, 0.9, 0.5), border = NA) +rect(var2_mean - 2*var2_sd, 0, var2_mean - var2_sd, 1100, col = rgb(0.7, 0.7, 0.7, 0.5), border = NA) +rect(var2_mean + 2*var2_sd, 0, var2_mean + var2_sd, 1100, col = rgb(0.7, 0.7, 0.7, 0.5), border = NA) + +rect(var2_mean - 3*var2_sd, 0, var2_mean - 2*var2_sd, 1100, col = rgb(0.5, 0.5, 0.5, 0.5), border = NA) +rect(var2_mean + 3*var2_sd, 0, var2_mean + 2*var2_sd, 1100, col = rgb(0.5, 0.5, 0.5, 0.5), border = NA) + +text(x = var2_mean - 1 , y = 1200, labels = expression(bar(x)), pos = 4, col = 'red', cex = 0.8) +text(x = var2_mean + 5, y = 1100, labels = expression(bar(x) + sd), pos = 4, col = 'black', cex = 0.8) +text(x = var2_mean - var2_sd , y = 1100, labels = expression(bar(x) - sd), pos = 4, col = 'black', cex = 0.8) + +text(x = var2_mean + 2*var2_sd - 15 , y = 1100, labels = expression(bar(x) + 2*sd), pos = 4, col = 'black', cex = 0.7) +text(x = var2_mean - 2*var2_sd , y = 1100, labels = expression(bar(x) - 2*sd), pos = 4, col = 'black', cex = 0.7) + +text(x = var2_mean + 3*var2_sd - 15 , y = 1100, labels = expression(bar(x) + 3*sd), pos = 4, col = 'black', cex = 0.7) +text(x = var2_mean - 3*var2_sd , y = 1100, labels = expression(bar(x) - 3*sd), pos = 4, col = 'black', cex = 0.7) + +``` +--- +name: correlation +# Correlation + +- Measuring the strength and direction of the **linear** relationship between two variables. + + - Positive Correlation: As one variable increases, the other also increases. + + - Negative Correlation: As one variable increases, the other decreases. + + - No Correlation: No directional relationship between the variables. + +--- +name: Pearson +# Types of correlation +- Pearson's correlation coefficient: Correlation of two **continuous** variables. +- Assumptions: + - Linear relationship. + - Normally distributed variables. + +```{r pearson,echo = F, eval = T, fig.width=10, fig.height=5} +set.seed(123) + +# Generate data for perfect positive correlation +x_pos <- seq(1, 100, length.out = 100) +y_pos <- x_pos + rnorm(100, mean = 0, sd = 1) # adding a tiny bit of noise for realism + +# Generate data for perfect negative correlation +x_neg <- seq(1, 100, length.out = 100) +y_neg <- -x_neg + rnorm(100, mean = 0, sd = 1) + +# Generate data for no correlation +x_none <- seq(1, 100, length.out = 100) +y_none <- rnorm(100, mean = 50, sd = 20) + +# Combine all datasets into a data frame +data <- data.frame( + x_pos = x_pos, + y_pos = y_pos, + x_neg = x_neg, + y_neg = y_neg, + x_none = x_none, + y_none = y_none +) + +# Plot the data to visualize the correlations +par(mfrow = c(1, 3), mar = c(5, 4, 4, 5) + 0.1) + +# Positive correlation +plot(data$x_pos, data$y_pos, main = paste0("Positive (r =", round(cor(data$x_pos, data$y_pos), digits = 4), ")"), xlab = "X", ylab = "Y", col = "blue", pch = 19) +abline(lm(data$y_pos ~ data$x_pos), col = "red", lwd = 2) + +# Negative correlation +plot(data$x_neg, data$y_neg, main = paste0("Negative (r =", round(cor(data$x_neg, data$y_neg), digits = 4), ")"), xlab = "X", ylab = "Y", col = "blue", pch = 19) +abline(lm(data$y_neg ~ data$x_neg), col = "red", lwd = 2) + +# No correlation +plot(data$x_none, data$y_none, main = paste0("No Correlation (r =", round(cor(data$x_none, data$y_none), digits = 2), ")"), xlab = "X", ylab = "Y", col = "blue", pch = 19) +abline(lm(data$y_none ~ data$x_none), col = "red", lwd = 2) + +``` + +$$ +r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} +$$ + + +--- +name: Spearman +# Types of correlation +- Spearman's rank correlation coefficient: Measures the monotonic relationship between two **ranked** variables. +- Assumptions: + - It is a non-parametric approach and does not require the data to be linearly correlated. + - The data is not normally distributed. + - For both conrinuous and ordinal (categorical) variables. +```{r spearman,echo = F, eval = T, fig.width=8, fig.height=4} +# Create the ordinal dataset +data_ordinal <- data.frame( + Satisfaction = c(5, 4, 3, 2, 1, 4, 5, 2, 3, 1), + Performance = c(9, 8, 7, 3, 2, 6, 10, 1, 5, 4) +) + +# Calculate Spearman's rank correlation +spearman_corr <- cor(data_ordinal$Satisfaction, data_ordinal$Performance, method = "spearman") + +# Plot to visualize the relationship +plot(data_ordinal$Satisfaction, data_ordinal$Performance, + xlab = "Satisfaction (Ordinal)", + ylab = "Performance (Rank)", + main = paste("Spearman's Correlation =", round(spearman_corr, 2)), + pch = 19, col = "blue") + +# Add a line to show the trend +abline(lm(data_ordinal$Performance ~ data_ordinal$Satisfaction), col = "red", lwd = 2) + +``` + +$$ +\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} +$$ +--- +name: closing +# More on statistics? +- We discussed about very basic descriptive statistical measures. +- You can read more [here](https://nbisweden.github.io/workshop-mlbiostatistics/session-descriptive/docs/index.html). + + +--- +name: end_slide +class: end-slide, middle +count: false + +# See you at the next lecture! +```{r, echo=FALSE,child="assets/footer-slide.Rmd"} +``` + +```{r,include=FALSE,eval=FALSE} +# manually run this to render this document to HTML +#rmarkdown::render("presentation_demo.Rmd") +# manually run this to convert HTML to PDF +#pagedown::chrome_print("presentation_demo.html",output="presentation_demo.pdf") +``` diff --git a/slide_r_elements_1.Rmd b/slide_r_elements_1.Rmd index c3a36ee4..0c8696c8 100644 --- a/slide_r_elements_1.Rmd +++ b/slide_r_elements_1.Rmd @@ -1,7 +1,7 @@ --- title: "Variables, Data types & Operators" subtitle: "Elements of the R language" -author: "Marcin Kierczak" +author: "Marcin Kierczak & Nima Rafati" keywords: bioinformatics, course, scilifelab, nbis, R output: xaringan::moon_reader: @@ -210,7 +210,7 @@ class(x) is.integer(x) ``` -> We need **casting** because sometimes a function requires data of some type! +> We need **casting** because sometimes a function requires data of certain type! --- name: casting2 diff --git a/slide_r_intro.Rmd b/slide_r_intro.Rmd index e9ba2404..baa159d8 100644 --- a/slide_r_intro.Rmd +++ b/slide_r_intro.Rmd @@ -1,7 +1,7 @@ --- title: "Introduction to R" -subtitle: "R Foundations for Life Scientists" -author: "Marcin Kierczak" +subtitle: "R Foundations for Data Analysis" +author: "Marcin Kierczak and Nima Rafati" keywords: bioinformatics, course, scilifelab, nbis, R output: xaringan::moon_reader: @@ -34,7 +34,7 @@ count: false #library(tidyr) #library(stringr) #library(ggplot2) -library(mkteachr) +#library(mkteachr) ``` --- @@ -44,10 +44,11 @@ class: spaced # Contents * [About R](#about) -* [Timeline](#timeline) -* [Ideas behind R](#ideas) * [Pros and cons of R](#pros_and_cons) * [Ecosystem of packages](#num_packages) +* [Programming language](#programming_language) +* [Packages](#packages) +* [Package installation](#pkg_cran_inst) --- name: about @@ -98,123 +99,177 @@ name: about --- name: timeline -# Timeline +--- +name: pros_and_cons +class: spaced + +# Pros and cons + + steep learning curve -- + uniform, clear and clean system of documentation and help -.pull-left-50[ +-- + difficulties due to a limited object-oriented programming capabilities, +e.g. an agent-based simulation is a challenge -![](data/slide_intro/Ihaka_and_Gentleman.jpg) +-- + good interconnectivity with compiled languages like Java or C -* ca. 1992 — conceived by [Robert Gentleman](https://bit.ly/35kn99L) and [Ross Ihaka](https://en.wikipedia.org/wiki/Ross_Ihaka) (R&R) at the University of Auckland, NZ as a tool for **teaching statistics** +-- + cannot order a pizza for you (?) -* 1994 — initial version -* 2000 — stable version +-- + a very powerful ecosystem of packages -] +-- + free and open source, GNU GPL and GNU GPL 2.0 -- + easy to generate high quality graphics -.pull-right-50[ +--- +name: programming_language -![](data/slide_intro/jjallaire_siliconangle_com.jpg) +# Programing Language -* 2011 — [RStudio](https://en.wikipedia.org/wiki/RStudio), first release by J.J. Allaire +-- +> Programming is a process of instructing a computer to perform a specific task. We write these instructions by **programming language**. It can be as simple as calculation (like a calculator) or complex applications. -![](data/slide_intro/hadley-wickham.jpg) +-- -* ca. 2017 — Tidyverse by [Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) -] + * flow of _data_ +-- ---- -name: ideas + * Data is collected information which qualitatively and/or quantitatively describe an entity. +-- -# Ideas behind R + * Data is collected from quite diverse sources (data types). +-- + + * Data processing. +-- -* open-source solution — fast development + * Data cleaning. -- -* based on the [S language](https://en.wikipedia.org/wiki/S_%28programming_language%29) created at the Bell Labs by [John Mc Kinley Chambers](https://bit.ly/2RhDqUx) to +```{r,out.width="75%",fig.align='center',echo=FALSE} +knitr::include_graphics("data/slide_programming/Data_Information_Knowledge.png") +``` +--- +# Programing Language cted. -> *turn ideas into software, quickly and faithfully* +-- + * from one _function_ to another -- -* [lexical scope](https://en.wikipedia.org/wiki/Scope_%28computer_science%29%23Lexical_scoping) inspired by [Lisp](https://en.wikipedia.org/wiki/Lisp) syntax + * Function is a **reusable** chunk of code that performs a task. It takes **inputs** as well as **arguments** to process. +-- + * each function does something to the data and return output(s) +-- + + * For example `mean()`, `min()` -- +--- +# Three things to think about -* since 1997 developed by the R Development Core Team (ca. 20 experts, with Chambers onboard; 6 are active) + * what *types* of data can I process? -- -* overviewed by [The R Foundation for Statistical Computing](https://www.r-project.org/foundation/) + * how do I *write* what I want? ---- -name: packages +-- -# Packages + * when does it *mean* anything? -.pull-right-50[ -```{r, out.width="250pt", fig.align='center', echo=FALSE} -knitr::include_graphics("data/slide_intro/packages.jpg") + +--- +# Data type + +```{r,out.width="75%",fig.align='center',echo=FALSE} +knitr::include_graphics("data/slide_programming/Data_classification.png") ``` -] +--- + +# Three components of a language -- -* developed by the community + * what *types* of data can I process — *type system* -- -* cover several very diverse areas of science/life + * int — 1 2 5 9 + * double — 1.23 -5.74 + * char — a b test 7 9 + * logical — TRUE/FALSE (T/F) + -- -* uniformely structured and documented + * how do I *write* what I want — *syntax* defined by a language *grammar* + + `2 * 1 + 1` vs. `(+ (* 2 1) 1)` -- -* organised in repositiries: - + [CRAN](https://cran.r-project.org) - + [R-Forge](https://r-forge.r-project.org) - + [Bioconductor](http://www.bioconductor.org) - + [GitHub](https://github.com) + * when does it *mean* anything — *semantics* + +-- + + * *Colorful yellow train sleeps on a crazy wave.* — has no generally accepted meaning + * *There is $500 on his empty bank acount.* — internal contradiction --- -name: pros_and_cons -class: spaced +name: topic2 -# Pros and cons +# Where to start? - steep learning curve --- - uniform, clear and clean system of documentation and help +*Divide et impera* — divide and rule. --- - difficulties due to a limited object-oriented programming capabilities, -e.g. an agent-based simulation is a challenge +**Top-down approach:** define the big problem and split it into smaller ones. Assume you have solution to the small problems and continue — push the responsibility down. +Wishful thinking! --- - good interconnectivity with compiled languages like Java or C +--- + +name: packages + +# Packages + +.pull-right-50[ +```{r, out.width="250pt", fig.align='center', echo=FALSE} +knitr::include_graphics("data/slide_intro/packages.jpg") +``` +] -- - cannot order a pizza for you (?) + +* developed by the community -- - a very powerful ecosystem of packages + +* cover several very diverse areas of science/life -- - free and open source, GNU GPL and GNU GPL 2.0 + +* uniformly structured and documented -- - easy to generate high quality graphics + +* organised in repositiries: + + [CRAN](https://cran.r-project.org) + + [R-Forge](https://r-forge.r-project.org) + + [Bioconductor](http://www.bioconductor.org) + + [GitHub](https://github.com) --- name: num_packages - # Ecosystem of R packages
@@ -232,6 +287,148 @@ gg + +--- +name: work_with_packages + +# Working with packages + +Packages are organised in repositories. The three main repositories are: + +* [CRAN](https://cran.r-project.org) +* [R-Forge](http://r-forge.r-project.org) +* [Bioconductor](http://www.bioconductor.org) + +We also have [GitHub](https://github.com). + +-- +# Working with packages -- CRAN example. + +```{r,out.width="80%",fig.align='center',echo=FALSE} +knitr::include_graphics("data/slide_r_environment/ggplot2_CRAN.png") +``` + +--- +name: pkg_cran_inst + +# Working with packages -- installation + +Only a few packages are pre-installed: + +```{r pkg.err.ex,eval=TRUE,error=TRUE} +library(XLConnect) +``` + +In order to install a package from command line, use: + +```{r pkg.inst,eval=FALSE} +install.packages("ggplot2",dependencies=TRUE) +``` + +--- +name: work_pkg_details + +# Working with packages -- details + +It may happen that you want to also specify the repository, e.g. because it is geographically closer to you or because your default mirror is down: + +```{r pkg.inst.repo,eval=FALSE} +install.packages('ggplot2',dependencies=TRUE,repos="http://cran.se.r-project.org") +``` + +But, sometimes, this does not work either because the package is not available for your platform. In such case, you need to *compile* it from its *source code*. + +--- +name: work_pkg_details2 + +# Working with packages -- details cted. +```{r,out.width="150%",fig.align='center',echo=FALSE} +knitr::include_graphics("data/slide_r_environment/ggplot2_CRAN.png") +``` + +--- +name: source_pkg_inst + +# Working with packages -- installing from source. + +- Download the source file, in our example *ggplot2_3.4.3.tar.gz*. +- Install it: + +```{r pkg.inst.src,eval=FALSE} +install.packages("path/to/ggplot2_3.4.3.tar.gz", + repos=NULL, + type='source', + dependencies=TRUE) +``` + +- Load it: + +```{r pkg.load,eval=FALSE} +library('ggplot2') # always forces reloading +require('ggplot2') # load only if not already loaded +``` + +- Enjoy! + +--- +name: pkg_github + +# Packages -- GitHub + +Nowadays, more and more developers distribute their packages via GitHub. The easiest way to install packages from the GitHub is via the *devtools* package: + +- Install the *devtools* package. +- Load it. +- Install. +- Enjoy! + +```{r pkg.inst.devtools.github,eval=FALSE} +install.packages('devtools',dependencies=TRUE) +library('devtools') +install_github('talgalili/installr') +``` + +--- +name: pkg_bioconductor + +# Packages -- Bioconductor + +```{r,out.width="200pt",fig.align='center',echo=FALSE} +knitr::include_graphics("data/slide_r_environment/logo_bioconductor.png") +``` + +First install Bioconductor Manager: + +```{r inst.biocond,eval=FALSE} +if (!requireNamespace("BiocManager",quietly = TRUE)) + install.packages("BiocManager") +``` + +--- +name: pkg_bioconductor2 + +# Packages -- Bioconductor cted. + +Now, you can install particular packages from Bioconductor: + +```{r biocond.inst.pkg,eval=FALSE} +BiocManager::install("GenomicRanges") +``` + +For more info, visit [Bioconductor website](http://www.bioconductor.org/install/). + +--- +# One package to rule them all -- the magic of `renv` + +- first time do `renv::activate()` and `renv::init()` +- while working: `renv::hydrate()` and `renv::snapshot()` + +Now, send `renv.lock` to your friend to share the environment and she can: + +- restore the environment `renv::restore()` + +**Pure magic!** + --- @@ -240,8 +437,7 @@ class: end-slide, middle count: false # Thank you! Questions? - -```{r,echo=FALSE,child="assets/footer-slide.Rmd"} +```{r, echo=FALSE,child="assets/footer-slide.Rmd"} ``` ```{r,include=FALSE,eval=FALSE}