abbagadabba

R package abbagadabba (amplicon-based biodiversity assessment, gap analysis, database building and more) provides functions to download DNA sequence data from NCBI based on species names.

Installation

abbagadabba is in active development and only exists on GitHub. It can be installed using devtools (which is a stable package available on CRAN).

library(devtools)
install_github("Maine-eDNA/abbagadabba")

Example usage

Suppose we have a list of species for which we want to retrieve sequence data. This list might include “bad names” (e.g. synonyms of misspellings). We want to first correct as many of those bad names as we can. Simultaneously, we’ll compile all information about the taxonomic hierarchy of those species

library(abbagadabba)

cleanNames <- getNCBITaxonomy(c('Idiomyia sproati', 'Drosophil murphy', 'no body'))
cleanNames

#   kingdom     phylum   class   order   suborder  infraorder superfamily
# 1 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 2 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 3    <NA>       <NA>    <NA>    <NA>       <NA>        <NA>        <NA>
#          family     subfamily        tribe      genus         old_name
# 1 Drosophilidae Drosophilinae Drosophilini Drosophila Drosophil murphy
# 2 Drosophilidae Drosophilinae Drosophilini Drosophila Idiomyia sproati
# 3          <NA>          <NA>         <NA>       <NA>          no body
#            ncbi_name   uid
# 1 Drosophila murphyi 48335
# 2 Drosophila sproati  7289
# 3               <NA>  <NA>

We were able to fix Drosophil murphy, making it Drosophila murphyi, and Idiomyia sproati, making it Drosophila sproati, but we couldn’t match no body.

Now we need to get all sequence identifiers associated with those species:

goodNames <- cleanNames$ncbi_name[!is.na(cleanNames$ncbi_name)]
seqIDs <-  getNCBISeqID(goodNames)
head(seqIDs)

# [1] "2053656522" "2053655721" "1679378317" "1679378287" "1679378281"
# [6] "1679378246"

Finally we can feed those IDs into the function to retrieve the sequences themselves. For the purpose of this example, we’ll just look at a few sequences

seqData <- getGenBankSeqs(seqIDs[1:2])
seqData

# $data
#           accession            species        date pubmed
# 1 JAEIFY000000000.1 Drosophila sproati 16-JUN-2021     NA
# 2 JAEIFX000000000.1 Drosophila murphyi 16-JUN-2021     NA
#                      pubDOI region product organelle region_note
# 1 10.1101/2020.12.14.422775     NA      NA        NA          NA
# 2 10.1101/2020.12.14.422775     NA      NA        NA          NA
#                     latlon                            locality coll_date
# 1 19.574513 N 155.216191 W USA: Waiakea Forest Reserve, Hawaii  Jun-2019
# 2 19.911621 N 155.313161 W                         USA: Hawaii  Apr-2018
#     coll_by specimen_id
# 1 Don Price          NA
# 2 Don Price          NA
# 
# $dna
# [1] ">JAEIFY000000000.1\n" ">JAEIFX000000000.1\n"

Example usage using locationToSequence.R

When wanting to find sequence data for a species based on a state or kingdom, execute locationToSequence.R. locationToSequence.R does require the RGBIF package which can be installed from the terminal

R
install.packages('rgbif')

From there just select a mirror and the package should install.

To modify the search parameters, line 5 calling occ_data function from the rgbif package is where queries can be specified.

full_data <- rgbif::occ_data(country = 'US', limit = 1, stateProvince = 'Maine', scientificName = '') 
#All parameters can be modified but should be left empty or deleted if not wanting specified.

From there, the sequence data (from each name) returned will be given based on the amount of sequences requested. This can be specified and changed on line 26.

seq_data <- getGenBankSeqs(seq_ids[1])
#Can be changed to seq_data <- getGenBankSeqs(seq_ids[1:5]) to get back multiple sequences

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
abbagadabba.Rproj		abbagadabba.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

abbagadabba

Installation

Example usage

Example usage using locationToSequence.R

About

Releases

Packages

Languages

License

TestingIC/abbagadabba

Folders and files

Latest commit

History

Repository files navigation

abbagadabba

Installation

Example usage

Example usage using locationToSequence.R

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages