01--prepare_reference_database.Rmd

---
title: "Reference database preparation"
bibliography: '`r sharedbib::bib_path()`'
output:
  html_document:
    css: style.css
---

```{r setup, include=FALSE}
source('style.R')
```

`dada2` requires a special format for assigning taxonomy using a reference database. 
Also, we are using two different loci and amplifying both fungi and oomycetes, so we will need a reference database for each locus.
The databases should have a format like:

```
>Kingom;Phylum;Class;Order;Family;Genus;Species;Reference
ACGAATGTGAAGTAA......
```

Where, "Reference" is a database-specific ID for each sequence, since there can be multiple sequences for a single species.
I will start the taxonomy at the Domain of Eukaryota, since it is shared with fungi.

## Preparation 

### Parameters

```{r}
seed <- 1
set.seed(seed)
```


### Packages used

```{r message=FALSE}
library(dada2)
library(metacoder)
library(purrr)
library(readr)
library(stringr)
library(tibble)
library(taxize)
library(dplyr)
library(DT)
library(ape)
```

### Output location

Any data generated by this analysis that is used in other analyses is stored in `intermediate_data`.
The formatted reference database FASTA files will be stored here.

```{r}
formatted_ref_dir <- file.path("intermediate_data", "reference_databases")
if (!dir.exists(formatted_ref_dir)) {
  dir.create(formatted_ref_dir, recursive = TRUE)
}
```


## Rps10 database

### Make FASTA headers

The rps10 database described by the associated publication will be used to assign taxonomy to ASVs/OTUs for the the rps10 MiSeq sequences.
Any unidentified sequences will be blasted to genbank for identification when analyzing non-target amplification, so no non-target sequences will be added here.
The sequences for species used in the mock community are already included in this database.

```{r}
rps10_db <- read_fasta(file.path("raw_data", "reference_databases", "2020-11-3_release_2_rps10.fasta"))
rps10_data <- str_match(names(rps10_db), pattern = "name=(.+)\\|strain=(.+)\\|ncbi_acc=(.+)\\|ncbi_taxid=(.+)\\|oodb_id=(.+)\\|taxonomy=(.+)$")
colnames(rps10_data) <- c("header", "name", "strain", "ncbi_acc", "ncbi_taxid", "oodb_id", "taxonomy")
rps10_data <- as_tibble(rps10_data)
datatable(rps10_data)
```

Lets remove the "cellular_organisms" root:

```{r}
rps10_data$taxonomy <- gsub(rps10_data$taxonomy, pattern = 'cellular_organisms;', replacement = '', fixed = TRUE)
head(rps10_data$taxonomy)
```

and add a "Heterokontophyta" rank, so it has the same number of ranks as the ITS database (see below):

```{r}
rps10_data$taxonomy <- gsub(rps10_data$taxonomy, pattern = 'Eukaryota;', replacement = 'Eukaryota;Heterokontophyta;', fixed = TRUE)
head(rps10_data$taxonomy)
```

This taxonomy has the genus and species joined together as a single level.
I will split them into their own levels:

```{r}
binomial <- map_chr(str_split(rps10_data$taxonomy, pattern = ';'), `[`, 7)
genus <- map_chr(str_split(binomial, pattern = '_'), `[`, 1)
unique(genus)
rps10_data$taxonomy <- map_chr(seq_along(rps10_data$taxonomy), function(index) {
  sub(rps10_data$taxonomy[index], pattern = binomial[index], replacement = paste0(genus[index], ';', binomial[index]))
})
head(rps10_data$taxonomy)
```

I will also add a "rank" for the name of the database and reference sequence, using the index as ID:

```{r}
rps10_data$taxonomy <- paste0(rps10_data$taxonomy, ';', 'oodb_', seq_along(rps10_data$taxonomy))
head(rps10_data$taxonomy)
```

Finally, add a trailing `;` to conform to the dada2 examples:

```{r}
rps10_data$taxonomy <- paste0(rps10_data$taxonomy, ';')
head(rps10_data$taxonomy)
```

and remove any white space:

```{r}
rps10_data$taxonomy <- trimws(rps10_data$taxonomy)
rps10_db <- trimws(rps10_db)
```

Lets check that the taxonomy is now formatted as expected:

```{r}
stopifnot(all(str_count(rps10_data$taxonomy, pattern = ";") == 9))
```


We can then associate these classifications with the sequences they were derived from as save it as a new fasta file:

```{r}
rps10_ref_path <- file.path(formatted_ref_dir, "rps10_reference_db.fa")
paste0(">", rps10_data$taxonomy, "\n", rps10_db) %>%
  write_lines(file = rps10_ref_path)
```

Lets make a table of the number of sequences in each genus:

```{r}
genus_count <- table(map_chr(strsplit(rps10_data$name, split = '_'), `[`, 1))
count_table <- as.data.frame(genus_count, stringsAsFactors = FALSE)
count_table <- as_tibble(count_table)
names(count_table) <- c('Genus', 'Number of sequences')
count_table
write_csv(count_table, file = file.path("results", "rps10_genus_count_table.csv"))
```


## ITS1 database

I will use a database which is a combination of the sequences from @robideau2011dna, PhytophthoraDB, and UNITE.
I will also add some sequences for the mock community that we sequenced to make sure they are included.

### Robideau 2011

These are sequences from a study that did a phylogeny of all oomycetes using ITS and COX.  

```{r}
rob_2011_seqs <- read_fasta(file.path('raw_data', 'reference_databases', 'robideau_2011_its_database.fa'))
rob_2011 <- extract_tax_data(names(rob_2011_seqs), regex = "(.*)", key = "class", class_sep = "|") %>%
  filter_taxa(!is_leaf) %>%
  filter_taxa(! grepl(taxon_names, pattern = '^unclassified_'))
stopifnot(all(rob_2011$data$tax_data$input == names(rob_2011_seqs)))
rob_2011$data$seqs <- setNames(rob_2011_seqs, rob_2011$data$tax_data$taxon_id)
rob_2011$data <- rob_2011$data["seqs"]
rob_2011
```


### UNITE

UNITE contains fungal species and some oomycetes.

```{r}
unite <- parse_unite_general(file = file.path('raw_data', 'reference_databases', 'sh_general_release_dynamic_all_02.02.2019.fasta'))
unite
```


### Phytophthora DB 

Phytophthora DB contains phytophthora sequences.

```{r}
phyto_db_seqs <- read_fasta(file.path('raw_data', 'reference_databases', 'phytophthora_db_its_database.fa'))
base_class <- lookup_tax_data('Phytophthora', type = 'taxon_name') %>% classifications() %>% tail(1)
phyto_db_class <- paste0(base_class, ';', str_match(names(phyto_db_seqs), '.+ \\((.+)\\)')[, 2])
phyto_db <- extract_tax_data(phyto_db_class, regex = "(.*)", key = "class", class_sep = ";") %>%
  filter_taxa(taxon_names != 'cellular organisms')
phyto_db$data$seqs <- setNames(phyto_db_seqs, phyto_db$data$tax_data$taxon_id)
phyto_db$data <- phyto_db$data["seqs"]
phyto_db
```

### Mock community sequences 

Not all the species used in the mock community have ITS1 sequences in the databases used so far. Here are counts of the number of sequences for each mock community member:

```{r}
mc_data <- read_csv(file.path('raw_data', 'mock_community.csv'))
# mc_data$species <- gsub(mc_data$species, pattern = ' ', replacement = '_')
vapply(gsub(mc_data$species, pattern = ' ', replacement = '_'), FUN.VALUE = numeric(1), function(s) {
  sum(agrepl(c(names(rob_2011_seqs), names(phyto_db_seqs), unite$data$tax_data$organism), pattern = s))
})
```

I therefore included Sanger sequences of the mock community members produced for this study.

```{r}
its_mc_seqs <- read_fasta(file.path('raw_data', 'reference_databases', 'mock_comm_its1_sanger.fasta'))
names(its_mc_seqs)
```

This file does not have their full taxonomy, so I will use the taxonomy from the rps10 database for the same species:

```{r}
names(its_mc_seqs) <- str_match(names(its_mc_seqs), pattern = '([A-z]+ [A-z]+) ?.*$')[, 2]
names(its_mc_seqs) <- gsub(names(its_mc_seqs), pattern = ' ', replacement = '_')
names(its_mc_seqs) <- map_chr(names(its_mc_seqs), function(n) {
  tax <- rps10_data$taxonomy[grep(rps10_data$name, pattern = n)]
  str_match(tax, pattern = paste0('^(.+', n, ').+$'))[1, 2] # match up until species name
})
head(names(its_mc_seqs))
```


### Make header taxonomy ranks consistent

The Phytophthora DB and Robideau sequences have a similar taxonomy

```{r}
rob_2011_seqs <- setNames(rob_2011$data$seqs, classifications(rob_2011)[names(rob_2011$data$seqs)])
head(names(rob_2011_seqs))
phyto_db_seqs <- setNames(phyto_db$data$seqs, classifications(phyto_db)[names(phyto_db$data$seqs)])
head(names(phyto_db_seqs))
```

But Unite is different: 

```{r}
unite_seqs <- setNames(unite$data$tax_data$unite_seq, classifications(unite)[unite$data$tax_data$taxon_id])
head(names(unite_seqs))
```

The Phytophthora DB and robideau sequences have a these ranks:

> Domain, Kingdom, Class, Order, Family, Genus, Species

But Unite, there is also Phylum:

> Kingdom, Phylum, Class, Order, Family, Genus, Species

In Unite, Stramenopiles is called Stramenopila and is considered a Kingdom. 

I can make the two the same, by adding the Eukaryota Domain to Unite and the Phylum Heterokontophyta to the other two, making the ranks:

> Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species


```{r}
names(unite_seqs) <- paste0('Eukaryota;', names(unite_seqs))
names(rob_2011_seqs) <- sub(names(rob_2011_seqs), pattern = ';Stramenopiles;',
                            replacement = ';Heterokontophyta;Stramenopiles;', fixed = TRUE)
names(phyto_db_seqs) <- sub(names(phyto_db_seqs), pattern = ';Stramenopiles;',
                            replacement = ';Heterokontophyta;Stramenopiles;', fixed = TRUE)
names(phyto_db_seqs) <- sub(names(phyto_db_seqs), pattern = ';Sar;',
                            replacement = ';', fixed = TRUE)
```

The taxonomy of the sequences from our lab have already been put into this format.

The taxonomy now has the same number of ranks, but there are some differences in the classification of oomycetes in UNITE vs the other databases:

```{r}
head(names(unite_seqs)[grepl(names(unite_seqs), pattern = 'Oomy')])
head(names(rob_2011_seqs))
head(names(phyto_db_seqs))
head(names(its_mc_seqs))
```

So I will change them to be all the same as much as possible:

```{r}
names(unite_seqs) <- sub(names(unite_seqs), fixed = TRUE,
                         pattern = 'Stramenopila;Oomycota',
                         replacement = 'Heterokontophyta;Stramenopiles')
names(phyto_db_seqs) <- sub(names(phyto_db_seqs), fixed = TRUE,
                         pattern = 'Oomycota',
                         replacement = 'Oomycetes')
```

Now they should be the same: 

```{r}
head(names(unite_seqs)[grepl(names(unite_seqs), pattern = 'Oomy')])
head(names(rob_2011_seqs))
head(names(phyto_db_seqs))
head(names(its_mc_seqs))
head(rps10_data$taxonomy)
```


### Combine sequences

I will add a "rank" for the reference too, with the name of the database the sequence came from and the source database index:

```{r}
names(unite_seqs) <- paste0(names(unite_seqs), ';', 'unite_', seq_along(unite_seqs))
names(rob_2011_seqs) <- paste0(names(rob_2011_seqs), ';', 'rob2011_', seq_along(rob_2011_seqs))
names(phyto_db_seqs) <- paste0(names(phyto_db_seqs), ';', 'phytodb_', seq_along(phyto_db_seqs))
names(its_mc_seqs) <- paste0(names(its_mc_seqs), ';', 'mock_', seq_along(its_mc_seqs))
its1_seqs <- c(unite_seqs, rob_2011_seqs, phyto_db_seqs, its_mc_seqs)
```

Note that the ID is an index of the sequence in the particular version of the database used.
Since none of the sequences were filtered out before this was assigned, this can be used to identify the particular sequence in a particular database.

### Remove unidentified sequences

Some of the databases (UNITE I think) has unidentified sequences.
I will remove these, so that they are not used when assigning ASVs and prevent a less good match to a more informative sequence:

```{r}
its1_seqs <- its1_seqs[! grepl(names(its1_seqs), pattern = 'unidentified')]
```


### Clean up formatting

Add combined index:

```{r}
names(its1_seqs) <- paste0(names(its1_seqs), '_', seq_along(its1_seqs))
head(names(its1_seqs))
```

This is the index of the sequence in the combined reference datbase.
No filtering should be done to the combined database after this.
Replace spaces with underscores:

```{r}
names(its1_seqs) <- gsub(names(its1_seqs), pattern = " ", replacement = "_")
```

Make sure they end in a `;`, like the example databases for dada2

```{r}
no_ending_semicolon <- ! endsWith(names(its1_seqs), ';')
names(its1_seqs)[no_ending_semicolon] <- paste0(names(its1_seqs)[no_ending_semicolon], ";")
```

Make sure everything is uppercase for consistently:

```{r}
its1_seqs <- toupper(its1_seqs)
```

and that there is not extra whitespace in header or sequence:

```{r}
its1_seqs <- trimws(its1_seqs)
names(its1_seqs) <- trimws(names(its1_seqs))
```

I will check for any sequences with ambiguity codes:

```{r}
sum(! grepl(its1_seqs, pattern = '^[AGCTagct]*$'))
```

Check for problems

```{r}
stopifnot(all(str_count(names(its1_seqs), pattern = ";") == 9))
stopifnot(all(nchar(its1_seqs) > 100))
stopifnot(! any(grepl(names(its1_seqs), pattern = 'NA')))
```


And finally save the results in a FASTA file

```{r}
its_ref_path <- file.path(formatted_ref_dir, "its1_reference_db.fa")
paste0(">", names(its1_seqs), "\n", its1_seqs) %>%
  write_lines(its_ref_path)
```


## Mock community sequences

### Look up synonyms for mock community

Many of the mock community members have synonyms and since multiple databases are used, multiple names might be used for the same taxa.
Therefore, I will look up the synonyms of taxa used in the mock community from COL.
The code below used to work, but the current version of `taxize` [has stopped supporting COL](https://github.com/ropensci/taxize/issues/796):

```{r eval=FALSE}
mc_data <- read_csv(file.path('raw_data', 'mock_community.csv'))
mc_syn_data <- as_tibble(synonyms_df(synonyms(mc_data$species, db = "col")))
mc_syn_data <- rename(mc_syn_data,  mc_name = .id, col_id = id, syn_name = name)
write_csv(mc_syn_data, file = file.path('intermediate_data', 'mock_comm_synonyms.csv'))
```

So I took the result of that code when it did work and load it here:

```{r}
mc_syn_data <- read_csv(file.path('raw_data', 'mock_comm_synonyms.csv'))
datatable(mc_syn_data)
```


### Check that mock community sequences are in reference databases

In order to fairly evaluate the performance of each primer pair / locus, we need to know if the mock community sequences are actually present in the reference databases.

```{r}
its_seqs <- read_fasta(its_ref_path)
rps10_seqs <- read_fasta(rps10_ref_path)
```

I will use `agrep` instead of `grep` to allow for any misspellings.

```{r}
is_in_db <- function(species, db, allow_ambig = TRUE) {
  names(db) <- gsub(names(db), pattern = '_', replacement = ' ')
  mc_syn_data <- readr::read_csv(file.path('raw_data', 'mock_comm_synonyms.csv'), col_types = cols())
  out <- purrr::map_lgl(species, function(s) {
    sp_names <- unique(c(s, mc_syn_data$syn_name[mc_syn_data$mc_name == s]))
    sp_names <- gsub(sp_names, pattern = '_', replacement = ' ')
    if (allow_ambig) {
      out <- any(vapply(sp_names, FUN.VALUE = logical(1), function(name) {
        any(grepl(names(db), pattern = name, ignore.case = TRUE))
      }))
    } else {
      out <- any(vapply(sp_names, FUN.VALUE = logical(1), function(name) {
        any(grepl(names(db), pattern = name, ignore.case = TRUE) & grepl(db, pattern = '^[AGCTagct]*$'))
      }))
    }
    return(out)
  })
  names(out) <- species
  out
}

mc_data$in_its_db <- is_in_db(mc_data$species, its_seqs)
mc_data$in_rps10_db <- is_in_db(mc_data$species, rps10_seqs)
```

Some of the methods used to assign taxonomy cannot use sequences with ambiguity codes, so I will check for sequences with no ambiguity codes.

```{r}
mc_data$in_its_db_no_ambig <- is_in_db(mc_data$species, its_seqs, allow_ambig = FALSE)
mc_data$in_rps10_db_no_ambig <- is_in_db(mc_data$species, rps10_seqs, allow_ambig = FALSE)
```

I will save a version of the mock community data that has these results included for other analyses

```{r}
write_csv(mc_data, file = file.path('intermediate_data', 'mock_community.csv'))
datatable(mc_data)
```

All the mock community species have sequences for both rps10 and ITS1 in the reference database:

```{r}
stopifnot(all(mc_data$in_its_db & mc_data$in_rps10_db))
```


### Make phylogeny of mock community sequences

This might be useful for seeing which species cant be distinguished.
Note that this is the phylogeny of the whole reference sequence, not just the amplicon.

```{r fig.width=10, fig.height=10}
nj_tree <- function(seqs, ...) {
  # Align sequences:
  aligned <- seqs %>%
    insect::char2dna() %>%
    ips::mafft(method = 'localpair', exec = '/usr/bin/mafft')
  
  # Make distance matrix
  dist <- ape::dist.dna(aligned, ...)
  
  # Make tree
  tree <- ape::nj(dist)
  tree <- ape::ladderize(tree)
  tree <- phangorn::midpoint(tree)
  
  tree
}

is_in_mc <- function(db) {
  mc_sp <- unique(c(mc_data$species, mc_syn_data$syn_name))
  names(db) <- gsub(names(db), pattern = '_', replacement = ' ')
  grepl(names(db), pattern = paste0(mc_sp, collapse = '|'))
}

plot_mc_tree <- function(db) {
  seqs <- db[is_in_mc(db)]
  db_names <- str_match(names(seqs), pattern = '.+;(.+);.+;$')[, 2]
  db_names <- gsub(db_names, pattern = '_', replacement = ' ')
  mc_names <- mc_syn_data$mc_name[match(db_names, mc_syn_data$syn_name)]
  names(seqs) <- db_names
  has_syn <- ! is.na(mc_names) & mc_names != db_names
  names(seqs)[has_syn] <- paste0(names(seqs[has_syn]), ' (', mc_names[has_syn], ')')
  seqs <- seqs[! duplicated(names(seqs))]
  tree <- nj_tree(seqs, model = 'N')
  plot.phylo(tree)
  axis(side = 1)
  title(xlab = 'Number of differing sites')
}

plot_mc_tree(rps10_seqs)
plot_mc_tree(its_seqs)
```

## Software used

```{r}
sessioninfo::session_info()
```


## References