-
Notifications
You must be signed in to change notification settings - Fork 0
arctosr: An interface to the Arctos Database for R
Biological specimens and their associated data, housed together in natural history museum biorepositories, represent the primary biodiversity infrastructure of the planet. Specimen data is served to the biological community through a series of interoperable, online databases. Specimen metadata provides ecological and spatial context for designing experimental sampling and conducting research using computational tools. Therefore, interfaces that simplify access to these databases are highly relevant. For instance, tools like rgbif
and spocc
, integrated into the R programming environment, are widely used because they make it easier to retrieve data from data aggregators, like the Global Biodiversity Information Facility (GBIF). Data aggregators, like GBIF, however, do not import all data fields available in Arctos, leaving large volumes of valuable research-grade data, inaccessible through existing R packages. For example, the Arctos database powerfully connects hosts to pathogens, including metadata about the molecular methods used to detect pathogens, but that information is not accessible at-scale through R. In addition, the primary user interface for the Arctos database is designed for general public users and curators such that specific queries and structured data for research applications can be complicated and non-intuitive. Creating an interface that uses the Arctos API as an R package would address this gap, ensuring researchers have access to a wider range of relevant information to advance scientific inquiry and guide strategic wildlife sampling moving forward.
No other R package facilitates direct access to the Arctos database and its metadata connections. Other packages exist to interface with data aggregators like GBIF, which scrape a subset of primary data from Arctos and other primary museum databases, but those aggregators do not integrate all fields included in Arctos. Critical information about pathogen test results from specimens, internal and external parasites, tissue type (heart, kidney, etc.), tissue quality (excellent to poor), pathogen screening methods, and preservation (-196C, -80C, -40C, -20C,95% EtOH, RNAlater, etc.), and trait measurements, for instance, is only available through Arctos. This project aims to facilitate the interface between this database and R to make these data more accessible and ready for research applications.
We expect the contributor to develop a new R package to access, download, and explore data from the Arctos database using its API. The main functions of the package will help to: Access the Arctos database and download data according to specific queries. Generate citable references for the data downloaded. Explore and summarize queried search results. Subset and organize data to facilitate research applications such as data frames, tables, etc. Save raw and processed data to local directories.
We expect this new R package to be fully documented at the end of the period. The development of one or two vignettes to demonstrate the use of the package's main functionalities is also expected. Implementation of tests will be optional for this version of the package.
A broad community of researchers and students from distinct and varied fields of science will benefit from this project. The tools to be developed allow access to information of great value for research in biodiversity science, evolutionary biology, anthropology, ecology, epidemiology, and other fields. By facilitating automated access to diverse biological databases, such initiatives not only improve research capabilities but also promote collaboration and knowledge sharing within the scientific community, thereby accelerating progress in various fields of study.
EVALUATING MENTOR: Marlon E. Cobos manubio13@gmail.com is an ecological modeler and biogeographer who has been a GSoC student and mentor since 2018 with the R Project Organization. Marlon is the author and maintainer of R packages like mop and kuenm, and has contributed to several packages on CRAN.
Vijay Barve vijay.barve@gmail.com is a biodiversity data scientist who has been a GSoC student and mentor since 2012 with the R Project Organization. Vijay is the author and maintainer of bdvis and has contributed to several packages on CRAN.
Jocelyn P. Colella colella@ku.edu is an evolutionary biologist and experienced Arctos user. Jocelyn uses Arctos as a source of information for specimen and tissue availability in collections, as well as a tool for experimental design, specifically for biogeographic sampling of specimens.
Michelle Koo mkoo@berkeley.edu is the Director of the Arctos Consortium and Staff Curator of Biodiversity Informatics and GIS at the University of California Berkeley’s Museum of Vertebrate Zoology.
Please do one or more of the following tests before contacting the mentors above. The more tests that are completed the better. Please post your solutions under the next section.
Easy:
- Install the package
spocc
and download occurrence data for one mammal species of your choice. - Filter the data to keep only records with geographic coordinates.
- Filter the data to keep records with coordinate uncertainty less than or equal to 1000 meters.
- Filter the data to keep only the columns: species, decimalLongitude, decimalLatitude, day, month, and year.
- Prepare a summary of the number of records after each of the steps described above.
Medium:
- Create a function that downloads specimen records from the Arctos database (https://arctos.database.museum/) via their API (https://handbook.arctosdb.org/documentation/api.html) with queries using species scientific/Latin names.
- Demonstrate the use of the function with a mammal species of your choice.
Hard:
- Create an R package that contains the function created above (and helper functions if needed).
- Create an example of how to install the package and use the function.
- Document the function and the package in general.
- Check the package using GitHub actions (no errors, warnings, or notes).
Please post a link to your test results here.
EXAMPLE: CONTRIBUTOR NAME, LINK TO GITHUB PROFILE, LINK TO GITHUB REPOSITORY WITH TEST RESULTS.
Contributor Name | GitHub Profile | Test Results |
---|---|---|
Harsh Jain | Github Profile | Test results |
Harlan Williams | https://github.com/hrhwilliams | https://github.com/hrhwilliams/gsoc2024-arctosr-tests/tree/main |