R package abbagadabba (amplicon-based biodiversity assessment, gap analysis, database building and more) provides functions to download DNA sequence data from NCBI based on species names.
abbagadabba is in active development and only exists on GitHub. It can be installed using devtools (which is a stable package available on CRAN).
library(devtools)
install_github("Maine-eDNA/abbagadabba")
Suppose we have a list of species for which we want to retrieve sequence data. This list might include “bad names” (e.g. synonyms of misspellings). We want to first correct as many of those bad names as we can. Simultaneously, we’ll compile all information about the taxonomic hierarchy of those species
library(abbagadabba)
cleanNames <- getNCBITaxonomy(c('Idiomyia sproati', 'Drosophil murphy', 'no body'))
cleanNames
# kingdom phylum class order suborder infraorder superfamily
# 1 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 2 Metazoa Arthropoda Insecta Diptera Brachycera Muscomorpha Ephydroidea
# 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# family subfamily tribe genus old_name
# 1 Drosophilidae Drosophilinae Drosophilini Drosophila Drosophil murphy
# 2 Drosophilidae Drosophilinae Drosophilini Drosophila Idiomyia sproati
# 3 <NA> <NA> <NA> <NA> no body
# ncbi_name uid
# 1 Drosophila murphyi 48335
# 2 Drosophila sproati 7289
# 3 <NA> <NA>
We were able to fix Drosophil murphy
, making it Drosophila murphyi
,
and Idiomyia sproati
, making it Drosophila sproati
, but we couldn’t
match no body
.
Now we need to get all sequence identifiers associated with those species:
goodNames <- cleanNames$ncbi_name[!is.na(cleanNames$ncbi_name)]
seqIDs <- getNCBISeqID(goodNames)
head(seqIDs)
# [1] "2053656522" "2053655721" "1679378317" "1679378287" "1679378281"
# [6] "1679378246"
Finally we can feed those IDs into the function to retrieve the sequences themselves. For the purpose of this example, we’ll just look at a few sequences
seqData <- getGenBankSeqs(seqIDs[1:2])
seqData
# $data
# accession species date pubmed
# 1 JAEIFY000000000.1 Drosophila sproati 16-JUN-2021 NA
# 2 JAEIFX000000000.1 Drosophila murphyi 16-JUN-2021 NA
# pubDOI region product organelle region_note
# 1 10.1101/2020.12.14.422775 NA NA NA NA
# 2 10.1101/2020.12.14.422775 NA NA NA NA
# latlon locality coll_date
# 1 19.574513 N 155.216191 W USA: Waiakea Forest Reserve, Hawaii Jun-2019
# 2 19.911621 N 155.313161 W USA: Hawaii Apr-2018
# coll_by specimen_id
# 1 Don Price NA
# 2 Don Price NA
#
# $dna
# [1] ">JAEIFY000000000.1\n" ">JAEIFX000000000.1\n"
When wanting to find sequence data for a species based on a state or kingdom, execute locationToSequence.R. locationToSequence.R does require the RGBIF package which can be installed from the terminal
R
install.packages('rgbif')
From there just select a mirror and the package should install.
To modify the search parameters, line 5 calling occ_data function from the rgbif package is where queries can be specified.
full_data <- rgbif::occ_data(country = 'US', limit = 1, stateProvince = 'Maine', scientificName = '')
#All parameters can be modified but should be left empty or deleted if not wanting specified.
From there, the sequence data (from each name) returned will be given based on the amount of sequences requested. This can be specified and changed on line 26.
seq_data <- getGenBankSeqs(seq_ids[1])
#Can be changed to seq_data <- getGenBankSeqs(seq_ids[1:5]) to get back multiple sequences