Skip to content
John Sundh edited this page Sep 28, 2022 · 1 revision

Quick start

To generate a file bold_clustered.fasta with COI-5P sequences, run:

coidb

This will download, filter and cluster sequences from GBIF Hosted Datasets.

See below for configuration and more options.

Output

Sequences in the resulting bold_clustered.fasta fasta file contain the original identifier as their primary id, and a string showing their taxonomic lineage in the fasta header:

>centroid=GBA28357-15 Arthropoda;Insecta;Psocodea;Philotarsidae;Aaroniella;Aaroniella sp.;seqs=1

In this example centroid= indicates that sequences from this species were clustered with vsearch and that the representative sequence for the resulting cluster is GBA28357-15.

Configuration

There are a few configurable parameters that modifies how sequences are filtered and clustered. You can modify these parameters using a config file in yaml format. The default setup looks like this:

database:
    # url to download info and sequence files from
    url: "https://hosted-datasets.gbif.org/ibol/ibol.zip"
    # gene of interest (will be used to filter sequences)
    gene:
        - "COI-5P"
    # phyla of interest (omit this in order to include all phyla)
    phyla: []
    # Percent identity to cluster seqs in the database by
    pid: 1.0

Gene types

By default, only sequences named 'COI-5P' are included in the final output. To modify this behaviour you can supply a config file in yaml format via -c <path-to-configfile.yaml>. For example, to also include 'COI-3P' sequences you can create a config file, e.g. named config.yaml with these contents:

database:
  gene:
    - 'COI-5P'
    - 'COI-3P' 

Then run coidb as:

coidb -c config.yaml

Typical gene names and their occurrence in the database are shown in this table.

Phyla

The default is to include sequences from all taxa. However, you can filter the resulting sequences to only those from one or more phyla. For instance, to only include sequences from the phyla 'Arthropoda' and 'Chordata' you supply a config file with these contents:

database:
  phyla:
    - 'Arthropoda'
    - 'Chordata' 

Typical phyla and their occurrence in the database are shown in this table.

Clustering

After sequences have been filtered to the genes and phyla of interest they are clustered on a per-species (or BOLD BIN id where applicable) basis using vsearch. By default this clustering is performed at 100% identity. To change this behaviour, to e.g. 95% identity make sure your config file contains:

database:
  pid: 0.95

Command line options

The coidb tool is a wrapper for a small snakemake workflow that handles all the downloading, filtering and clustering.

usage: coidb [-h] [-n] [-j CORES] [-f] [-u] [-c [CONFIG_FILE ...]] [--cluster-config CLUSTER_CONFIG] [--workdir WORKDIR] [-p] [targets ...]

positional arguments:
  targets               File(s) to create or steps to run. If omitted, the full pipeline is run.

optional arguments:
  -h, --help            show this help message and exit
  -n, --dryrun          Only print what to do, don't do anything [False]
  -j CORES, --cores CORES
                        Number of cores to run with [4]
  -f, --force           Force workflow run
  -u, --unlock          Unlock working directory
  -c [CONFIG_FILE ...], --config-file [CONFIG_FILE ...]
                        Path to configuration file
  --cluster-config CLUSTER_CONFIG
                        Path to cluster config (for running on SLURM)
  --workdir WORKDIR     Working directory. Defaults to current dir
  -p, --printshellcmds  Print shell commands

Explanation:

-n, --dryrun: Only print what will be done, don't actually do anything.

-j, --cores: The number of cores to run the workflow with. Because the download and filtering steps have to be run in sequential order this only affects the clustering step using vsearch.

-f, --force: Force the execution of the workflow even though files already exist.

-u, --unlock: Release a working directory lock (which could result from a previously interrupted run)

-c, --configfile: Supply a configuration file to alter the behaviour of the tool.

--workdir: Specify the directory in which to read/write output files. Defaults to the current directory.

-p, --printshellcmds: Shows the actual commands as they are being executed.

Step-by-step

You can also run the coidb tool in steps, e.g. if you are only interested in some of the files or if you want to inspect the results before proceeding to the next step. This is done using the positional argument targets.

Valid targets are download, filter and cluster.

Step 1: Download

For example, to only download files from GBIF you can run:

coidb download

This should produce two files bold_info.tsv and bold_seqs.txt containing metadata and nucleotide sequences, respectively.

Step 2: Filter

To also filter the bold_info.tsv and bold_seqs.txt files (according to the default 'COI-5P' gene or any other genes/phyla you've defined in the optional config file) you can run:

coidb filter

This filters sequences in bold_seqs.txt and entries in bold_info.tsv to potential genes and phyla of interest, respectively. Entries are then merged so that only sequences with relevant information are kept. Output files from this step are bold_filtered.fasta and bold_info_filtered.tsv.

Step 3: Clustering

The final step clusters sequences in bold_filtered.fasta on a per-species basis. This means that for each species, the sequences are gathered, clustered with vsearch and only the representative sequences are kept. In this step sequences can either have a species name or a BOLD BIN ID (e.g. BOLD:AAY5017) and are treated as being equivalent.

To run the clustering step, do:

coidb cluster

The end result is a file bold_clustered.fasta.

Common gene types

#seqs gene
6074566 COI-5P
153409 COI-3P
146758 ITS
114124 matK
110798 ITS2
86915 rbcL
66793 rbcLa
14192 16S
13496 CYTB
10675 trnH-psbA
9787 COII
9140 28S
9066 COXIII
6166 ND2
5872 ND1
5868 ND5-0
5868 ND3
5867 ND4
5863 ND4L
5843 ND6
5772 ITS1
4866 28S-D2
3940 12S
3751 18S
3547 atp6
3459 5-8S
3135 trnL-F
3027 D-loop
2870 EF1-alpha
1991 Wnt1
1822 Rho
1722 COI-PSEUDO
1716 H3
1326 CAD
1241 rpoC1
1236 atpF-atpH
968 tufA
944 COI-LIKE
865 rpoB
865 UPA
749 psbK-psbI
597 28S-D2-D3
470 CAD4
449 PSBA
431 PGD
393 DBY-EX7-8
383 GAPDH
353 RpS5
336 ycf1
309 AATS
296 28S-D1-D2
240 28S-D9-D10
223 MDH
194 TPI
192 trnD-trnY-trnE
188 LWRHO
186 RAG1
172 H4
171 COII-COI
168 ND6-ND3
167 RAG2
167 16S-ND2
166 IDH
154 RpS2
144 18S-V4
141 28S-D3-D5
137 RNF213
132 MC1R
132 MB2-EX2-3
125 fbpA
124 ND4L-MSH
124 ArgKin
120 CADH
117 CHD-Z
107 ENO
103 28S-D3
101 CHOLC
99 VDAC
98 ADR
95 RPB2
94 atpB-rbcL
94 atp6-atp8
92 DYN
91 H3-NUMT
88 COI-NUMT
86 PSA
86 CYTB-NUMT
81 AOX-fmt
72 trnK
69 matR
65 CsIV
64 nucLSU
64 EF2
61 TYR
61 ARK
56 ATP1A
55 petD-intron
55 matK-trnK
53 PLAGL2
47 psbA-3P
38 PER
31 matK-like
31 FL-COI
30 CAD1
30 18S-3P
25 rbcL-like
24 DDC
21 HfIV
20 R35
17 COII-COIII
16 RBM15
16 NGFB
16 CK1
15 WSP
14 psaB
14 TULP
10 rpL32-trnL
10 PY-IGS
9 EF1-alpha-5P
7 NBC-COI-5P
4 COI-5PNMT1
2 TMO-4C4
2 PKD1
1 S7
1 RPL37
1 RPB1
1 RBCL-5P
1 COI-5PNMT2
1 Beta-tubulin

Common phyla

#entries Phylum
5886491 Arthropoda
505704 Chordata
270743 Magnoliophyta
180809 Mollusca
76301 Ascomycota
58536 Annelida
48727 Basidiomycota
29817 Rhodophyta
28723 Echinodermata
28105 Platyhelminthes
21786 Nematoda
19453 Cnidaria
16321 Bryophyta
9116 Rotifera
8368 Pteridophyta
7122 Chlorophyta
5863 Pinophyta
4877 Porifera
4863 Heterokontophyta
3770 Nemertea
3516 Glomeromycota
2934 Zygomycota
2095 Acanthocephala
1787 Bryozoa
1671 Tardigrada
1512 Pyrrophycophyta
1339 Chaetognatha
1248 Onychophora
952 Lycopodiophyta
711 Gastrotricha
640 Sipuncula
573 Ciliophora
393 Kinorhyncha
370 Nematomorpha
276 Chytridiomycota
273 Cycliophora
223 Myxomycota
202 Brachiopoda
153 Ctenophora
149 Hemichordata
104 Priapulida
102 Phoronida
61 Chlorarachniophyta
48 Rhombozoa
21 Entoprocta
16 Xenacoelomorpha
16 Gnathostomulida
12 Placozoa