-
Notifications
You must be signed in to change notification settings - Fork 8
Example: 2x300, 2 loci
Wiki in progress 7/8-15/16
Here is an example analysis from start to finish. A colleague agreed to let me post the analysis I am doing for him as I work through it here, so it represents real data we hope to one day publish. This analysis encompasses 45 samples from a total of 192 on the run. The other samples belonged to other users who requested the same loci and sequencing mode, but are not of interest to us here. Every sample on the run was sequenced for both 16S V4 (515F-806R) and ITS2 (5.8S_fun-ITS4_Fun). Sequencing was carried out on an Illumina MiSeq running in 2x300 mode. 2x300 sequencing uses MiSeq v3 chemistry compared to 2x250 or 2x150 runs which use v2 chemistry. v3 runs allow for imaging of more of the flowcell surface which results in more reads which is useful when you have many samples, or if you are doubling up your loci. However, 300 base reads are longer than the 16S v4 product (~253bp primer to primer), so we will want to take care to remove 3' primer sequences and use reads that are appropriately sized. This run used separate sequencing primers specific to each locus so the start of each read is just beyond the 5' primer. As well, this run used single-indexing so we may encounter higher levels of PhiX contamination and/or sample cross-talk.
Check raw files -- map files and fastqs
Here is a screenshot of my Nautilus (File browser in Ubuntu) window listing the contents of my raw_data directory:
Mapping files
You can see the essentials are present. Both main fastq reads (read1.fastq, read2.fastq), the indexing fastq (index1.fastq) and map file for each locus (map.16S.txt, map.ITS.txt). Those fastq reads are huge files (13.1GB each). To start, I want to make sure my map files look alright. I will use the word count command (wc) with -l to list the number of lines (should be 45 samples):
wc -l map.16S.txt
wc -l map.*
I first did this specifying the 16S map, and got 46 lines in the file (first line is header). Then I used the glob "*" to read both files at the same time and they are both the same length.
Next I will peek at the maps with the head command to make sure they are tab-delimited as sometimes LibreOffice saves my map files as comma-separated files if I am not careful enough.
head map.16S.txt
head map.ITS.txt
Fastqc to check raw qualities
Looks good, as did the other one (not shown here). Next I want to get a sense of the raw fastq data. To do this I will use fastqc. If you don't know fastqc, you should try it out. Here is the very simple syntax:
fastqc read1.fastq
fastqc read2.fastq
Fastqc will report progress at 5% increments. As I mentioned, these are large files so I will do something else for a bit while these process (~15 minutes). To process the second read, I opened a second terminal and started fastqc on read2 simultaneously:
When fastqc is done, it will deposit a directory for each read as well as a .zip archive of the same result (handy if you need to email to your PI or a collaborator). In this screenshot I have also expanded the read_fastqc directory to reveal the fastqc_report.html file within:
Open fastqc_report.html in a browser. There will be a separate file for each read that you should inspect. Fastqc is meant mainly for genomic sequencing data whereas amplicon data has plenty of nuance that will cause fastqc to flag several portions of your reads for quality issues. This is normal, and you should simply pay attention to quality here. Even though fastqc indicates there are quality issues with these reads, I am quite comfortable moving forward from here:
Read 1 quality
Read 2 quality
First notice that read 2 has lower quality than read 1. This is perfectly normal. Also notice the x-axis scale is not linear, so that precipitous drop in quality near the end is actually over a bit of distance with respect to the entirety of the read. Keep in mind this is just a starting place, and at this point we just want to know what the overall quality of the run is. My sense is these data are pretty nice and I will be able to use most of the reads. Some of the 3' low quality comes from sequencing through the 16S data (~50% of the run), so primer removal will improve upon the quality seen here. Read joining will further "rescue" some of the lower quality bases by alignment consensus.
Primer removal
Use the akutils primer_file command to get the reverse/complemented primer sequences from the akutils primer database and start the removal of primers from the 16S component:
akutils primer_file
When I have done this for both primers, I get the following output:
The order of primers in the file here is important. Since we are doing directional primer removal (removing the rev/comp primer sequence used to produce read 1 from the tail of read 2 and vice versa), the primer name that corresponds to the first read must be first in the primer file. In this case, 515F was used to produce the first read, and 806R was used to produce the second read.
Now I can issue the command to remove 16S primers. Remember to issue the command first with no arguments to bring up strip_primers usage:
akutils strip_primers
akutils strip_primers 13 read1.fastq read2.fastq index1.fastq
This will take a while, but since the primers have already been read into the command, I can delete the existing primer file and get the ITS2 primers removing at the same time by building a new primer_file.txt:
rm primer_file.txt
akutils primer_file
Note the order of primers here. In our lab we sequence ITS2 in a "reverse direction" which reads from the LSU position first (ITS4_Fun), and from the 5.8S position second (5.8S_Fun).
Now remove ITS2 primers as well (exact same command as for 16S):
akutils strip_primers 13 read1.fastq read2.fastq index1.fastq
The end of the output gives me runtime for each data set. This is often important information to know if you are requesting resources on a computing resource (such as Monsoon at NAU). Here we see that both 16S and ITS ran for about 46 minutes:
16S:
ITS:
The times are about equivalent, reassuring me that the loci are distributed proportionally across the run (~50% 16S or ITS). I am very happy with this result.
You might notice the "9259 empty fastq records found" in the 16S output and a similar message for ITS (only 1667 records). This is important! Primer removal by many methods can result in complete removal of a sequence from a fastq without removing all lines associated with that read from the file. This leaves the fastq header (normal), then a blank line (where sequence should be), then a line with just a "+" sign (normal), and finally another blank line where your qual scores should be. This is OK for some applications, but in my experience it is never OK for QIIME-based applications. There are several threads on the QIIME forum discussing such problems with fastq data and how to deal with them. I finally wrote a filter into the strip_primers script when I realized why I was having such problems. It is unclear why this happens, but some kind of chimeric product is my guess as to what such sequences represent. Alternatively, I think maybe incomplete cluster turnaround during sequencing could be to blame. That there are about 5 times as many empty records associated with the 16S product I would assume has to do with the size of the PCR products (~253 bp for v4, 290 bp and larger for ITS2) relative to the lengths of the reads (301 bp for this run).
Check results of primer removal by generating length histograms of fastqs
You just saw that some fastq records are left empty following primer removal. This means there are likely a large range of sequence lengths remaining. It is therefore useful to perform the next step, inspecting the distribution of read lengths following primer removal.
First, list the current contents of the raw_data directory:
ls -l
This shows the new directories where the strip_primers output resides. Navigate into these directories (I have two terminal windows open that I am working from, one for each locus):
cd strip_primers_out_515Frc-806Rrc_3prime/
cd strip_primers_out_5.8S_Funrc-ITS4_Funrc_3prime/
And list the contents of those directories (only showing 16S, but you should do it for both):
ls -l
You can see a log file there, which you should inspect. It has a bunch of lines because it contains your output from the cutadapt command which provides quite a bit of detail about what occurred during trimming. There is also a little amount of detail from the akutils strip_primers command including the primers used and the runtime.
Use cat to view the log file (adjust command to your own file name):
cat log_strip_primers_20160709_0258PM.txt
Here is the top of the cutadapt-specific output for the 16S:
This summary tells you that around 7 million reads of the total (~20 million) contain 16S sequence and were properly trimmed. Keep in mind that we are expecting all the 16S products that were properly produced to be trimmed, while only some of the ITS sequences will be trimmed.
Scroll down a bit and you see a warning message telling you your adapter sequence may be incomplete. This is normal because we are working with amplicon data which use primers that anneal to evolutionarily conserved regions of the ribosomal gene. You can ignore this warning.
Scroll down some more and you will get into the length of trimmed sequences from each read. Since the 16S product should be about 253 bp, and the sequencing run was 2x300bp (2x301 technically), we should expect a peak at 48 bp for the length trimmed here.
And that is exactly what we see.
Scroll down further and you will see the 9259 reads where 301 bases were removed, generating those empty records.
Now let's make some length histograms to visualize the entire file rather than just those which were trimmed by cutadapt. In each strip_primers_out directory, I will run the following commands:
fastq_length_histogram.sh read1.noprimers.fastq
fastq_length_histogram.sh read2.noprimers.fastq
These commands result in the following output:
The output file has some preceding spaces in it to make it easier to read on the command line. The first column is the count for the read length, listed in the second column. For clarity here I want to make a plot in LibreOffice. I will use sed and a regular expression to make the file parse properly during import to the spreadsheet:
sed -i "s/^\s\+//g" histogram.read1.noprimers.fastq.txt
I can now open the file in LibreOffice and construct a column chart.
This shows a peak at 253 (right where I expect) for 16S sequences. There is another peak at 301 where nothing was trimmed. Even though we know there is a range of sizes remaining, the majority fall to the baseline and are not visible here.
Using cat to view the histogram file (only showing read1 here):
cat histogram.read1.noprimers.fastq.txt
You can easily find the 253bp peak in the 16S data:
The ITS data is less affected by trimming because we expect the product sizes to be closer to 300 bases minimum (more like 320 or 330 with these primers). I just need to know that I have made an effort to remove the primers from the ITS data before moving on.
Remove PhiX contamination:
This step removes PhiX from your data, quantifying the amount that demultiplexes with your samples, and reduces your raw data to only what is represented in your mapping files (sample-specific data only). This will reduce the size of your data files such that if you need to filter on size, it won't take up too much RAM.
I am going to navigate into the raw_data directory and build new output directories for each mapping file. When ready to remove the PhiX, the first step is to call the usage.
akutils phix_filtering
Then issue the command:
akutils phix_filtering phix_filtered_16S map.16S.txt strip_primers_out_515Frc-806Rrc_3prime/index1.noprimers.fastq strip_primers_out_515Frc-806Rrc_3prime/read1.noprimers.fastq strip_primers_out_515Frc-806Rrc_3prime/read2.noprimers.fastq
Output:
Repeat for ITS data:
phix_filtering phix_filtered_ITS map.ITS.txt strip_primers_out_ITS4_Funrc-5.8S_Funrc_3prime/index1.noprimers.fastq strip_primers_out_ITS4_Funrc-5.8S_Funrc_3prime/read1.noprimers.fastq strip_primers_out_ITS4_Funrc-5.8S_Funrc_3prime/read2.noprimers.fastq
Output:
Everything finished without issue. Note the percent contamination is similar. The ITS took longer to process. Once demultiplexed, it has more sequence per read remaining than the 16S, and read mapping takes time. This result is good and I will continue.
Read joining:
Some people are very particular about their read joining. There are several options available, but I have been using fastq-join for several years now and continue to be happy with it. You need to be careful, but not too careful. Fastq-join has two major options available for adjustment: percent mismatch (-p) and minimum overlap (-m) (fastq-join wiki page). I want an adequate overlap before allowing pairs to merge (30 bases is my arbitrary standard), and I allow up to 30% mismatch. This sounds high, but trust me it works. I ran a bunch of benchmarking and found that overly stringent merging yielded similar results to overly lax settings (-p at 0 or 1 gives output like -p at 60-99). I got about the same result if I allowed -p from 5 to 45. -p 30 allows most of your data to merge without incurring serious mismatch issues which would manifest as sequences slightly longer than anticipated.
Here is a slide I made to help convince people I know what I am talking about with regard to percent mismatch and fastq-join:
So navigate into the phix_filtered output directories and call the usage:
cd phix_filtered_16S
cd phix_filtered_ITS
akutils join_paired_reads
And issue the command (same for either locus). My indexes are 12 base Golay codes:
akutils join_paired_reads index.phixfiltered.fastq read1.phixfiltered.fastq read2.phixfiltered.fastq 12 -m 30 -p 30
Output (16S):
(ITS):
We want to know how well the read merging went. To do this we need to examine the log, in particular the output from fastq-join that provides information on the amount of success and read length standard deviations.
First, I cat the log file (specific to my own file):
cat join_paired_reads_out/log_join_paired_reads_20160713_0140PM.txt
And find the relevant lines (16S):
ITS result:
Examine join results with length histogram
Navigate into your join_paired_reads_out directories (within the phix_filtered out for each locus) and produce a length histogram for the resulting read file (rd.fq):
cd join_paired_reads_out
fastq_length_histogram.sh rd.fq
Now use cat to print it out in a terminal, or maybe use the sed command from before to allow for import into a spreadsheet for graphing. Graphing is more useful for ITS which is size polymorphic than 16S. Here are plots for comparison.
16S text output:
16S (only showing 200-300 bases):
ITS (only showing 290-475 bases):
The 16S peak at 253 bases is obvious with a little bit of drift resulting in a few hundred thousand reads at 251, 252, and 254 bases. The ITS data has substantial peaks ranging from about 300 to 466 bases (from examining text output). We will exclude all other sizes by filtering the fastq output by length.
16S:
filter_fastq_by_length.sh 3 251 254 rd.fq idx.fq
ITS:
filter_fastq_by_length.sh 3 300 466 rd.fq idx.fq
The last thing to do is to generate fastqc plots for the size-filtered output and trim them to a uniform length for processing.
fastqc rd.251-254.fq
fastqc rd.300-466.fq
16S:
The 16S results look great, and the remaining variation in size (251-254) is within 5%, so no need to do any more trimming.
ITS:
The shortest ITS reads are 300 bases, though the quality holds above q20 to about 450bp (scroll up to see the quality improvement realized by read joining and other pre-processing steps over the raw data). Nonetheless, we need to trim these reads uniformly to 300 bases before continuing. For this we will use fastq-mcf from ea-utils:
fastq-mcf -0 -l 300 -L 300 n/a rd.300-466.fq -o rd.300.fq
This output indicates all reads were retained, and we can move ahead with analysis.
Pick OTUs workflow (akutils pick_otus)
These data are almost ready for processing. Navigate back up to the top of your directory (above raw_data) and make new directories in which to process your data:
mkdir 16S_processing
mkdir ITS_processing
Copy your appropriate map file into each processing directory along with the joined, size-filtered read and index files. Rename your read and index files in the process to rd.fq and idx.fq (naming convention for akutils pick_otus workflow):
cp raw_data/map.16S.txt ./16S_processing/
cp raw_data/map.ITS.txt ./ITS_processing/
cp raw_data/phix_filtered_16S/join_paired_reads_out/rd.251-254.fq ./16S_processing/rd.fq
cp raw_data/phix_filtered_16S/join_paired_reads_out/idx.251-254.fq ./16S_processing/idx.fq
cp raw_data/phix_filtered_ITS/join_paired_reads_out/rd.300.fq ./ITS_processing/rd.fq
cp raw_data/phix_filtered_ITS/join_paired_reads_out/idx.300-466.fq ./ITS_processing/idx.fq
These directories now contain three files each:
Before processing, I want to have an appropriately set up akutils config file in each directory, and I will be using a parameter file to control the distance threshold with Swarm. I also want to do the split_libraries_fastq.py step manually for the ITS data rather than allow the pick_otus script to handle it. The reason is the sequencing orientation (LSU to 5.8S direction) relative to database files (UNITE) and Hmmer profiles for ITSx.
I will get this started before doing my akutils config and parameter files.
split_libraries_fastq.py -i rd.fq -b idx.fq -o split_libraries -m map.ITS.txt -q 19 -p 0.95 -r 0
This produces output in a directory called "split_libraries" (important for proper akutils function). The reads must contain no bases below q20 after quality trimming, and any trimmed reads must be 95% of the original length in order to be retained. When the command finishes, I immediately want to examine the log:
cat split_libraries/split_library_log.txt
I am interested in the top of this which tells how many sequences were input and how many were discarded for various reasons:
This output shows I am still losing over 50% of my reads to quality filtering (too short after truncation). I will run another fastqc on the rd.fq file to get a better idea of the quality across the reads:
The last 4 box and whiskers seem to have lower quality. I will trim these reads once more to 250 bases, rename to output, and rerun the split_libraries_fastq.py command:
fastq-mcf -0 -l 250 -L 250 n/a rd.fq -o rd.250.fqCommand Line: -0 -l 250 -L 250 n/a rd.fq -o rd.250.fq
mv rd.250.fq rd.fq
rm -r split_libraries/
split_libraries_fastq.py -i rd.fq -b idx.fq -o split_libraries -m map.ITS.txt -q 19 -p 0.95 -r 0
And the top of the log shows:
Removing those last 50 bases allowed an improved return of only about 130,000 extra reads. I will still get adequate taxonomic depth with 250 bases, but I'd rather not lose any additional resolution by removing more sequence information just to get my read counts up from here. You may find it useful to play around with your data in this way if you need to get your read counts up for some reason and are losing a lot to quality filtering.
Recall that the ITS data was sequenced in reverse orientation to the database files. I will use adjust_seq_orientation.py from QIIME to reverse complement my demultiplexed reads. If I had done this prior to split_libraries_fastq.py, the quality filtering would have started at the ratty 3' end and I would have lost many more reads. I will use "-r" to avoid renaming the fasta headers, and then rename each file so that akutils will find the correct output when processing begins.
cd split_libraries/
adjust_seq_orientation.py -r -i seqs.fna
mv seqs.fna seqs_original_orientation.fna
mv seqs_rc.fna seqs.fna
When done, my ITS split_libraries directory appears thusly:
Parameter file construction:
For this I use nano. I will use Swarm for OTU picking, but I wish to lower the distance threshold (resolution) to 4 (based on my own benchmarking). Open nano like this (from within a processing directory):
nano parameter_file.txt
Add the following text and save the file:
pick_otus:swarm_resolution 4
We want to process 16S and ITS both this way, so copy the parameter file to the ITS directory as well.
akutils config files:
These will be specific to each locus. Mainly with regard to database files (Greengenes for 16S, UNITE for ITS). I will use Swarm for OTU picking and BLAST for taxonomy assignment. I will dereplicate sequences on the first 100 bases (again, based on my own benchmarking and manual assessment of differences within typical conspecific sequences).
My config files contain the following:
16S:
ITS:
When ready to process data, each of my processing directories appear like so:
Pick OTUs command:
Since I am working on a stand-alone workstation, I could enter commands one at a time, or limit processor consumption to run them simultaneously. But this isn't efficient. If working on a cluster, the resource scheduler would take care of allocating enough CPU and RAM resources to your job. Since I used the akutils_ubuntu_installer I have Task Spooler installed which can handle a command queue. To simply issue the command to begin processing the 16S data, I would do the following:
akutils pick_otus 16S
To run this command with Task Spooler, I will add some bits to the beginning of the command. The -m flag tells Task Spooler to email me when the job completes (or exits for whatever reason), and the -L flag labels my job with the subsequent string. The akutils command then comes last:
ts -m -L MG_16S_otus akutils pick_otus 16S
The output tells me this is job 0. I can monitor progress with the -c flag, specifying the job number:
ts -c 0
From my ITS processing directory I can check on the status of the queue:
ts
You can see there is just one job running, and the queue is set to run one job at a time (run=1/1). I will now queue the ITS data:
ts -m -L MG_ITS_otus akutils pick_otus ITS
The queue now shows both jobs present, the 16S job (job 0) running, and the ITS job (job 1) waiting:
These jobs will probably run well into the evening. Task Spooler will let me know when they finish and will handle execution of the queue. I can come back to this in the morning.
Pick OTUs workflow output
In fact, these jobs ran fairly quickly. Here are screenshots of most of the output of the ts commands:
16S:
ts -c 0
ITS:
ts -c 1
In total, these two workflows only ran about 3.5 hours (16S for ~1 hour, ITS for ~2.5 hours). This speed is largely made possible through the dereplication step (prefix/suffix step). The ITS ran longer due to the ITSx step which does not benefit from dereplication. The 16S had 47,600 OTUs to assign taxonomy to, and this is what consumed the bulk of the "Sequential OTU picking steps" (00h:35m:52). ITS had just 10,481 sequences, and the same section ran just 00h:03:34. If you wanted, you could rerun the workflow using longer dereplication at 150nt and 200nt and see if you get about the same estimate of diversity, however I contend that much of this newly discovered "diversity" is likely sequencing error and in the interest of making conservative estimates based on inferences such as environmental DNA sequencing, I would stick with the 100nt dereplication for now, unless obvious problems arise downstream. It is easy to reconfigure akutils to adjust such parameters if you need to.
We will move ahead in a minute, but first let's just check the biom table summaries for each data set.
16S:
cat swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.summary
ITS:
cat swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.summary
These counts look OK. I may dump a couple of the lower counts in the 16S data, but I will move on for now.
Alignment and phylogenetic tree construction
Now we will proceed with aligning the representative sequences and building a phylogenetic tree. For 16S we will use the "core_set_aligned" alignment template and "lanemask_in_1s_and_0s" lanemask file from Greengenes in concert with the parallelized PyNAST command from QIIME. For ITS we will use Mafft. For both sets we will use the make_phylogeny.py command in QIIME with default settings (Fasttree). The ITS phylogeny will NOT be valid, but I like to see the result, and it doesn't take much extra time anyway. Later we will run an analysis using Ghost-tree for a more acceptable use of phylogenetic metrics from ITS data. To perform these steps, we can use the akutils align_and_tree workflow.
Adjust CPU consumption
I want both of these jobs to run simultaneously, so I will limit the amount of CPUs (24 available) to these workflow scripts with a direct adjustment to the config file for each data set.
akutils configure CPU_cores 12
This brings up a dialogue and I accept, and repeat for the ITS config file.
Then I adjust Task Spooler to allow two jobs to run at once.
ts -S 2
akutils align_and_tree workflow
16S:
akutils align_and_tree 16S swarm_otus_d4/
With Task Spooler:
ts -m -L MG_align_16S akutils align_and_tree 16S swarm_otus_d4/
ITS:
akutils align_and_tree other swarm_otus_d4/
With Task Spooler:
ts -m -L MG_align_ITS akutils align_and_tree other swarm_otus_d4/
These complete relatively quickly.
16S:
ITS:
ITS ran longer than 16S even though 16S has over 4 times as many sequences to align. PyNAST alignment is fully parallelized, so that speeds things up, but the 16S alignment is also much better quality which limits the amount of time subsequently needed for phylogeny construction.
Prepare ITS data for use with Ghost-tree phylogeny
Ghost-tree (Ghost-tree repository) is a nifty tool that compares taxonomy strings between UNITE (ITS sequences) and SILVA (18S sequences) databases, extracts "representatives" from SILVA which correspond to your taxonomy assignments from UNITE, and builds a phylogeny which you can then use with your ITS data. I built one and then supplied it along with my formatted ITS2 UNITE sequences in my QIIME_databases repository. This file in the ITS2 directory (ghost_tree.nwk) will work with any data set for which taxonomy was assigned against the corresponding fasta and taxonomy files in the same directory. However, Ghost-tree limits your analysis to closed-reference OTUs only. In my experience, fungal ITS2 data does not work well with closed-reference OTU picking. To get around this, I wrote a script (preprocess_otus_for_ghost-tree.sh) which renames the OTU designations from de novo analyses (e.g., denovo1, denovo2, etc) according to their taxonomic designation just as would be output from a closed-reference analysis. It really is just a simple search and replace function, and it makes all the difference. Before starting any diversity analyses, I will run this script to produce an alternative output. It is very important to note that there will be a loss in sequencing depth by running this script because any OTUs which lack adequate taxonomic assignment to find a Ghost-tree match will be removed.
Call usage:
Execute against .nwk file in database using the "005 table" from my output OTU tables directory:
This shows the alternative output .biom file to be used in conjunction with the alternative tree output. I want to see an OTU table summary so I will summarize any .biom files in the OTU table directory with biom-summarize_folder.sh.
biom-summarize_folder.sh swarm_otus_d4/OTU_tables_blast_taxonomy/
And cat the output:
cat swarm_otus_d4/OTU_tables_blast_taxonomy/005_table_ghost-tree_filtered.summary
Recall the summary from the table we used as input to the "preprocess_for_ghost-tree" script:
cat swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.summary
You can see my lowest count sample has decreased from 1273 to 577, and the number of OTUs decreased from 451 to 383. This isn't too bad, and I note the density increased slightly from 0.185 to 0.190 possibly indicating a "cleaned up" table which could yield better statistical power. On to diversity analyses.
Core diversity workflow
When I am ready to run diversity analysis, I start with the mapcats.sh command to list off the categories present in the mapping file:
mapcats.sh map.16S.txt
I am interested in PlantSpecies, SoilType, State, and InvasiveStatus. I will now pass those categories to each akutils core_diversity command, allowing 6 CPU cores for each job (16S, ITS, ITS + ghost-tree = 18 cores total). The workflow will automatically find the trees from the align_and_tree step.
akutils core_diversity usage:
akutils core_diversity
16S:
akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.biom map.16S.txt PlantSpecies,SoilType,State,InvasiveStatus 6
To background everything I will execute these commands with Task Spooler. I will set it to allow 3 simultaneous jobs, and then execute everything.
ts -S 3
ts -m -L MG_16S_cdiv akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.biom map.16S.txt PlantSpecies,SoilType,State,InvasiveStatus 6
I want to see the initial output before I let everything run, so I get the output from Task Spooler (job 4):
ts -c 4
This depth is pretty low. I'm going to stop this output. Unfortunately there isn't a great way to kill a running job in Task Spooler. I use htop and select processes with the mouse which I kill with "k". I also remove the output directory (multiple times). Eventually, I get an email from Task Spooler that the job is "done."
htop
rm -r swarm_otus_d4/OTU_tables_blast_taxonomy/core_diversity/
This time I will set the rarefaction depth to 5,000, which will remove three samples from the analysis.
akutils configure Rarefaction_depth 5000
Now I can re-execute the command as before:
ts -m -L MG_16S_cdiv akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.biom map.16S.txt PlantSpecies,SoilType,State,InvasiveStatus 6
View output with ts command:
ts -c 5
Better. On to ITS analysis.
ITS:
The ITS data didn't have a super low sample, so I will allow the automatic rarefaction depth. First I will execute on the 005_table.biom output, which should automatically locate the tree using the Mafft alignment. Again, I am using Task Spooler in this example.
akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.biom map.ITS.txt PlantSpecies,SoilType,State,InvasiveStatus 6
With Task Spooler commands:
ts -m -L MG_ITS_cdiv akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table.biom map.ITS.txt PlantSpecies,SoilType,State,InvasiveStatus 6
And view the output with ts command (job 6).
ts -c 6
You can see it correctly found the tree file we generated. Since that variable has now been read into the script, I can now modify the config file to point the tree to the ghost-tree file for the alternative output.
akutils configure Tree swarm_otus_d4/ghost-tree_output_005_table/005_table_ghost_tree.nwk
And now I can execute the core_diversity command using the ghost-tree phylogeny.
akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table_ghost-tree_filtered.biom map.ITS.txt PlantSpecies,SoilType,State,InvasiveStatus 6
With Task Spooler commands:
ts -m -L MG_ITS_cdiv_ghosttree akutils core_diversity swarm_otus_d4/OTU_tables_blast_taxonomy/005_table_ghost-tree_filtered.biom map.ITS.txt PlantSpecies,SoilType,State,InvasiveStatus 6
And view the output with ts command (job 7).
ts -c 7
The tree we specified has been read into the script, so now we wait until the results are complete.
Assessing results
Will continue wiki construction soon! (7/14/16)
Script wiki pages:
akutils
akutils align_and_tree
akutils check
akutils check_result
akutils configure
akutils core diversity
akutils format_database
akutils join_paired_reads
akutils phix_filtering
akutils pick_otus
akutils primer_file
akutils primer_list
akutils print_config
akutils strip_primers
akutils test
akutils test_result
akutils update
ancomR.sh
biom-summarize_folder.sh
biomtotxt.sh
concatenate_fastqs.sh
convert_table_for_ancomR.sh
fastq_data.sh
fasta_length_histogram.sh
fastq_length_histogram.sh
filter_fasta_by_length.sh
filter_fastq_by_length.sh
filter_observations_by_sample.sh
indicator_species.sh
ITSx_parallel.sh
mapcats.sh
preprocess_otus_for_ghost-tree.sh
slurm_builder.sh
synthetic_index.sh
two-way_permanova.sh
txttobiom.sh
unwrap_fasta.sh
Tutorial pages:
akutils tutorial
ITS analysis
slurm usage
Running akutils on monsoon
Example: 2x300, 2 loci
akutils wiki home:
akutils wiki home
akutils home page:
akutils home page
akutils github page:
akutils github page