Skip to content

Amplicon-based biodiversity assessment, gap analysis, database building, and more!

License

Notifications You must be signed in to change notification settings

TestingIC/abbagadabba

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

abbagadabba

R package abbagadabba (amplicon-based biodiversity assessment, gap analysis, database building and more) provides functions to download DNA sequence data from NCBI based on species names.

Installation

abbagadabba is in active development and only exists on GitHub. It can be installed using devtools (which is a stable package available on CRAN).

library(devtools)
install_github("Maine-eDNA/abbagadabba")

Example usage

Suppose we have a list of species for which we want to retrieve sequence data. This list might include “bad names” (e.g. synonyms of misspellings). We want to first correct as many of those bad names as we can. Simultaneously, we’ll compile all information about the taxonomic hierarchy of those species

library(abbagadabba)

cleanNames <- getNCBITaxonomy(c('Idiomyia sproati', 'Drosophil murphy', 'no body'))
cleanNames
#   kingdom     phylum   class   order   suborder  infraorder superfamily
# 1 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 2 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 3    <NA>       <NA>    <NA>    <NA>       <NA>        <NA>        <NA>
#          family     subfamily        tribe      genus         old_name
# 1 Drosophilidae Drosophilinae Drosophilini Drosophila Drosophil murphy
# 2 Drosophilidae Drosophilinae Drosophilini Drosophila Idiomyia sproati
# 3          <NA>          <NA>         <NA>       <NA>          no body
#            ncbi_name   uid
# 1 Drosophila murphyi 48335
# 2 Drosophila sproati  7289
# 3               <NA>  <NA>

We were able to fix Drosophil murphy, making it Drosophila murphyi, and Idiomyia sproati, making it Drosophila sproati, but we couldn’t match no body.

Now we need to get all sequence identifiers associated with those species:

goodNames <- cleanNames$ncbi_name[!is.na(cleanNames$ncbi_name)]
seqIDs <-  getNCBISeqID(goodNames)
head(seqIDs)
# [1] "2053656522" "2053655721" "1679378317" "1679378287" "1679378281"
# [6] "1679378246"

Finally we can feed those IDs into the function to retrieve the sequences themselves. For the purpose of this example, we’ll just look at a few sequences

seqData <- getGenBankSeqs(seqIDs[1:2])
seqData
# $data
#           accession            species        date pubmed
# 1 JAEIFY000000000.1 Drosophila sproati 16-JUN-2021     NA
# 2 JAEIFX000000000.1 Drosophila murphyi 16-JUN-2021     NA
#                      pubDOI region product organelle region_note
# 1 10.1101/2020.12.14.422775     NA      NA        NA          NA
# 2 10.1101/2020.12.14.422775     NA      NA        NA          NA
#                     latlon                            locality coll_date
# 1 19.574513 N 155.216191 W USA: Waiakea Forest Reserve, Hawaii  Jun-2019
# 2 19.911621 N 155.313161 W                         USA: Hawaii  Apr-2018
#     coll_by specimen_id
# 1 Don Price          NA
# 2 Don Price          NA
# 
# $dna
# [1] ">JAEIFY000000000.1\n" ">JAEIFX000000000.1\n"

Example usage using locationToSequence.R

When wanting to find sequence data for a species based on a state or kingdom, execute locationToSequence.R. locationToSequence.R does require the RGBIF package which can be installed from the terminal

R
install.packages('rgbif')

From there just select a mirror and the package should install.

To modify the search parameters, line 5 calling occ_data function from the rgbif package is where queries can be specified.

full_data <- rgbif::occ_data(country = 'US', limit = 1, stateProvince = 'Maine', scientificName = '') 
#All parameters can be modified but should be left empty or deleted if not wanting specified.

From there, the sequence data (from each name) returned will be given based on the amount of sequences requested. This can be specified and changed on line 26.

seq_data <- getGenBankSeqs(seq_ids[1])
#Can be changed to seq_data <- getGenBankSeqs(seq_ids[1:5]) to get back multiple sequences

About

Amplicon-based biodiversity assessment, gap analysis, database building, and more!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 87.8%
  • JavaScript 12.2%