smsk: A Snakemake skeleton to jumpstart projects

1. Description

This is a small skeleton to create Snakemake workflows. Snakemake "is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style."

The idea is to create a workflow with of snakefiles, resolve dependencies with conda, pip, tarballs, and if there is no other option, docker.

2. First steps

Follow the contents of the .travis.yml file:

Install (ana|mini)conda
- Anaconda
- miniconda

Installation

git clone https://github.com/jlanga/smsk.git smsk
cd smsk
conda --use-conda --create-envs-only

Execute the test pipeline:
```
snakemake --use-conda -j
```
Modify the following files:
- samples.yml with info on your samples,
- cluster.yml (optional) with run time and memory of the intensive tasks.
Run the pipeline with your data
```
snakemake --use-conda -j
```
(Optional) Run the pipeline inside a Docker container:
```
bash src/run_docker.sh
```

3. File organization

The hierarchy of the folder is the one described in Good enough practices in scientific computing:

smsk
├── cluster.yml: info on memory and running time of certain jobs.
├── data: raw data or links to backup data.
├── doc: documentation.
├── reports: reports generated by this example.
├── README.md
├── results: processed data.
├── Snakefile: driver script of the project. Mostly links to src/snakefiles.
└── src: project's source code, snakefiles tarballs, scripts for the lazy.

4. Writting workflow considerations

The workflow should be written in the main Snakefile and all the subworkflows in src/snakefiles.
Split into different snakefiles as much as possible. This way code supervision is more bearable and you can recycle them for other projects.
Start each rule name with the name of the subworkflow (map): map_bowtie, map_sort, map_index.
Use a Snakefile to store all the folder names instead of typing them explicitelly (bin/snakefiles/folders.py), and using variables with the convention SUBWORKFLOW_NAME: map_bwa_sample, map_sort_sample, etc.
End a workflow with a checkpoint rule: a rule that takes as input the result of the workflow (map). Use the subworkflow name as a folder name to store its results: map results go into results/map/.
Log everything. Store it next to the results: rule/name_wildcards.log. Store also benchmarks. Consider even creating a subfolder if the total number of files is too high.
End it also with a clean rule that deletes everything of the workflow (clean_map).
Use the bin/snakefiles/raw to get/link your raw data, databases, etcetera. You should be careful when cleaning this folder.
Configuration for software, samples, etcetera, should be written in config files (instead of hardcoding them somewhere in a 1000 line script). Command line software usually comes with mandatory parameters and optional ones. Ideally, write the mandatory ones in each snakefile and the optional in the config files.
In the same way that in Bioconductor datasets are split into three different tables (experimental, sample and feature data), we should do the same:
- sample.yml: information about your samples: library names, paths to fastq files, type of sequencing, ...
- features.yml: information about the reference genome, transcriptome, annotation, ...
- params.yml: non-default parameters of each one of the programs we are using: number of maximum threads, compression levels of samtools, quality thresholds for trimmomatic, ...
- cluster.yml: information about how to execute in the cluster environment each of the rules: how many threads and how much memory.
shell.prefix("set -euo pipefail;") in the first line of the Snakefile makes the entire workflow to stop in case of even a warning or a exit error different of 0. Maybe not necessary anymore (2016/12/13).
If compressing, use pigz, pbzip2 or pxz instead of gzip. Get them from conda.
Install as many possible packages from conda and pip instead of using apt/apt-get: software is more recent, and you don't have to unzip tarballs or rely on your sysadmin. This way your workflow is more reproducible. The problem I see with brew is that you cannot specify an exact version.

To install software from tarballs, download them into src/ and copy binaries to bin/ (and write the steps in bin/install/from_tarball.sh):

# Binaries are already compiled
wget \
    --continue \
    --output-document src/bin1.tar.gz \
    http://bin1.com/bin1.tar.gz
tar xvf src/bin1.tar.gz
cp src/bin1/bin1 bin/ # or link

# Tarball contains the source
wget \
    --continue \
    --output-document src/bin2.tar.gz \
    http://bin2.com/bin2.tar.gz
tar xvf src/bin2.tar.gz
pushd src/bin2/
make -j
cp build/bin2 ../../bin/

Use as much as possible temp() and protected() so you save space and also protect yourself from deleting everything.
Pipe and compress as much as possible. Make use of the process substitution feature in bash: cmd <(gzip -dc fa.gz) and cmd >(gzip -9 > file.gz). The problem is that it is hard to estimate the CPU usage of each step of the workflow.
End each subworkflow with a report for your own sanity. Or write the rules in bin/snakefiles/report. Make use of all the tools that multiqc supports.
Use in command line applications long flags (wget --continue $URL): this way it is more readable. The computer does not care and is not going to work slower.
If software installation is too complex, consider pulling a docker image.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
config		config
doc		doc
profile/default		profile/default
workflow		workflow
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smsk: A Snakemake skeleton to jumpstart projects

1. Description

2. First steps

3. File organization

4. Writting workflow considerations

Bibliography

About

Releases

Packages

Languages

License

jlanga/smsk

Folders and files

Latest commit

History

Repository files navigation

smsk: A Snakemake skeleton to jumpstart projects

1. Description

2. First steps

3. File organization

4. Writting workflow considerations

Bibliography

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages