The number of reads varied greatly between samples, both for ITS (fungi, Fig. 1) and 16S (bacteria, Fig. 2). This is likely to induce biases against rare taxa in the samples with fewer reads.
FIGURE TA-1. MultiQC plot of the number of ITS forward reads per sample (fungi)
FIGURE TA-2. MultiQC plot of the number of 16S forward reads per sample (bacteria)
The quality of ITS reads (fungi) was below the 1% error rate (Phred score = 20) towards the last ca. 50 bp of the sequences (Fig. 3). This was especially relevant for reverse (R2) reads and is likely to preclude the merging of R1 and R2 reads.
FIGURE TA-3. MultiQC plot of the quality score (Phred) per base call along ITS reverse reads (fungi)
The quality of 16S reads (bacteria) was below the 1% error rate for up to 2/3 of the read lenght for a proportion of the data (Fig. 4). This is likely to induce large data loss in order to obtain robust diversity assumptions.
FIGURE TA-4. MultiQC plot of the quality score (Phred) per base call along 16S reverse reads (bacteria)
*Phred score = 20: likelihood of finding 1 incorrect base call among 100 bases.
Between 13-58% of raw reads per sample were filtered out using Trimmomatic
due to their low accuracy below the 1% error rate. Additionally, the ITS amplicon that was generated with primers ITS1F and ITS4 has an average size >600 bp, so most R1 and R2 reads could not be merged accurately (up to >90% of reads per sample). We therefore analysed the ITS1 region only, and used both R1 and merged reads in amptk
. To account for the large variation in read lenght within ITS1, we set the read lenght to 350 bp and "padded" shorter reads with unknown bp (Ns).
Between 5-61% of raw reads per sample were filtered out using Trimmomatic
due to their low accuracy below the 1% error rate. The 16S primers 27F/519R generated an amplicon of 450-490 bp per sample after merging the reads. We set the read lenght to 500 bp (without "padding" shorter reads) and used only merged reads to improve data quality in amptk
. We avoided the recovery of unmerged R1 reads (as for ITS above) because it generated a enormous amount of dubious sequences, including in the negative control samples.
Two approaches are commonly used to identifiy microbial diversity from sequence data: 1) amplicon sequence variants (ASV) represent unique sequences, and 2) operational taxonomic units (OTU) result from clustering sequences at a similarity threshold, usually 97%. The ASV approach has been used extensively to study bacterial communities but are expected to perform poorly for fungal groups that commonly exhibit multiple variants of rRNA gene and ITS copies per genome (Tedersoo et al. 2022). ASV approaches therefore tend to overestimate richness of common fungal species (due to haplotype variation) but underestimate richness of rare species (by removing rare variants). Because community composition is driven by abundant taxa, the results are similar betwen ASV and OTU approaches.
19,030,426 reads passed quality filtering and clustered into 17,991 OTUs. 2296 OTUs did not match Fungi, giving a total of 15,695 fungal OTUs generated with amptk.
The mock community samples had 78 to >200 OTUs instead of the expected 10, indicating a high index bleed in both runs. We therefore filtered out low abundance OTUs (<0.5% of total read count per sample) in R to obtain the adequate number of OTUs in the mock community samples. Our approach emphasizes how index bleed can inflate the overall diversity of a dataset by more than x10. It is therefore important to filter out low abundance OTUs based on the OTU count per sample, using mock community controls to parametrize these filters.
After filtering for index bleed and contaminants reads from negative control samples using R, as well as removing OTUs from other samples that were included in the runs, we obtained a final OTU table with 2794 OTUs. Out of these, 1570 OTUs (56%) were present in only one sample and species accumulation curves were far from being saturated (Fig. 5). This indicates that our dataset only detected a small portion of the fungal diversity occuring in Victorian soils.
FIGURE 5. Species accumulation curves representing the number of fungal OTUs recovered with increasing sample effort for the overall diversity and each functional guild
Only 232,938 reads passed quality filtering and clustered into 1,040 bacterial OTUs. After filtering for index bleed and contaminants reads from negative control samples using R, as well as removing OTUs from other samples that were included in the runs, we obtained a final OTU table with 838 OTUs. Out of these, 151 OTUs (18%) were present in only one sample and species accumulation curves were approaching saturation (Fig. 6). This indicate that our sampling captured a good proportion of the bacterial diversity occuring in Victorian soils.
FIGURE 6. Species accumulation curves representing the number of bacterial OTUs recovered with increasing sample effort
We assigned the functional guild of fungal OTUs based on their taxonomy according to the FungalTraits database (Polme et al. 2020). The final table with the taxonomy and corresponding guild of each OTU can be found here.
Out of 2794 OTUs, 566 taxa corresponded to ectomycorrhizal species, 63 to arbuscular mycorrhizal species, 621 to saprotrophic species, 105 to plant pathogens and 46 to parasitic species. The remaining 1393 OTUs (50%) did not have sufficent taxonomic identification to be assigned a guild. This enlightens the need for more specimen-based research in mycology in order to accurately link DNA sequences to species.
-
In order to accurately merge ITS reads, it is important to generate and sequence ITS1 and ITS2 amplicons separately so that their lenght corresponds to the read size of the sequencing platform (300 bp for Illumina MiSeq 2x300).
-
Bench work needs to be optimized (equimolar concentrations between samples, adjusting PhiX spike-in) to improve the general quality of raw sequences
-
We recommend to check the quality of each run beforehand and apply extra quality filters (for example using Trimmomatic) to filter out reads below the 1% error rate before denoising/clustering. This greatly reduces the risk of mistaking sequencing error for genetic diversity.
-
Do not allow any error when reading the barcode during demultiplexing (the Australian Microbiome Initative currently allows two barcode mismatches).
-
Index bleed can inflate the overall diversity of a dataset by more than x10. We therefore recommand to filter out low abundance OTUs based on the OTU count per sample (instead of overall count) and parametrize these filters based on mock community samples.
- Despite the large sample size, 56% of fungal OTUs were detected in only one sample. Species accumulation curves indicate that only a small portion of fungal diversity was detected and sampling effort needs to be extended over more sites and replicates per sites (for example at different time points throughout the year) to capture the diversity of soil fungi in Victoria.
- 50% of fungal OTUs could not be assigned to a functional guild due to insufficent taxonomic identification. This emphazises the gap of knowledge about soil fungi in Victoria and the need to conduct more specimen-based research, integrating new species descriptions and DNA sequencing, to aid the identification of soil microorganisms and functions from eDNA samples. Such knowledge is critical for eDNA studies and soil health monitoring that depend on databases with accurate sequence identification (Truong et al. 2017).