smsk: A Snakemake skeleton to jumpstart projects

1. Description

This is a small skeleton to create Snakemake workflows. Snakemake "is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style."

The idea is to create a workflow with of snakefiles, resolve dependencies with conda, pip, tarballs, and if there is no other option, docker.

2. First steps

Follow the contents of the .travis.yml file:

Install (ana|mini)conda

Installation

git clone https://github.com/jlanga/smsk.git smsk 
cd smsk
bash bin/install/conda_env.sh  # Dowload packages and create an environment

Activate the environment (source deactivate to deactivate):
```
source activate smsk
```
Execute the pipeline:
```
snakemake -j
```

3. File organization

The hierarchy of the folder is the one described in A Quick Guide to Organizing Computational Biology Projects:

smsk
├── bin: your binaries, scripts, installation and virtualenv related files.
├── data: raw data, hopefully links to backup data.
├── README.md
├── results: processed data.
└── src: additional source code, tarballs, etc.

4. Writting workflow considerations

The workflow should be written in the main Snakefile and all the subworkflows in bin/snakefiles.
Split into different snakefiles as much as possible. This way code supervision is more bearable and you can recycle them for other projects.
Start each rule name with the name of the subworkflow (map), and mark that it is executed over a item (_sample): map_bowtie_sample, map_sort_sample, map_index_sample.
Use a snakefile to store all the folder names instead of typing them explicitelly (bin/snakefiles/folders), and using variables with the convention SUBWORKFLOW_NAME: map_bwa_sample, map_sort_sample, etc.
End a workflow with a checkpoint rule: a rule that takes as input the result of the workflow (map). Use the subworkflow name as a folder name to store its results: map results go into results/map/.
Log everything. Store it next to the results: rule/rest_of_rule_name_sample.log. Store also benchmarks in JSON format. Consider even creating a subfolder if the total number of files is too high.
End it also with a clean rule that deletes everything of the workflow (clean_map).
Use the bin/snakefiles/raw to get/link your raw data, databases, etcetera. You should be careful when cleaning this folder.
Configuration for software, samples, etcetera, should be written in the config.yaml (instead of hardcoding them somewhere in a 1000 line script). Command line software usually comes with mandatory parameters and optional ones. Ideally, write the mandatory ones in each snakefile and the optional in config.yaml.
shell.prefix("set -euo pipefail;") in the first line of the Snakefile makes the entire workflow to stop in case of even a warning or a exit error different of 0. Maybe non necessary anymore (2016/12/13).
If compressing, use pigz, pbzip2 or pxz instead of gzip. Get them from conda.
Install as many possible packages from conda and pip instead of using apt/apt-get: software is more recent this way, and you don't have to unzip tarballs or rely on your sysadmin. This way your workflow is more reproducible. The problem I see is that you cannot specify exact versions in brew.

To install software from tarballs, download them into src/ and copy them to bin/ (and write the steps in bin/install/from_tarball.sh):

# Binaries are already compiled
wget \
    --continue \
    --output-document src/bin1.tar.gz \
    http://bin1.com/bin1.tar.gz
tar xvf src/bin1.tar.gz
cp src/bin1/bin1 bin/ # or link

# Tarball contains the source
wget \
    --continue \
    --output-document src/bin2.tar.gz \
    http://bin2.com/bin2.tar.gz
tar xvf src/bin2.tar.gz
pushd src/bin2/
make -j
cp build/bin2 ../../bin/

Use as much as possible temp() and protected() so you save space and also protect yourself from deleting everything.
Pipe and compress as much as possible. Make use of the process substitution feature in bash: cmd <(gzip -dc fa.gz) and cmd >(gzip -9 > file.gz). The problem is that it is hard to estimate the CPU usage of each step of the workflow.
End each subworkflow with a report for your own sanity. Or write the rules in bin/snakefiles/report
Use in command line applications long flags (wget --continue $URL): this way it is more readable. The computer does not care and is not going to work slower.
If software installation is too complex, consider pulling a docker image.

5. Considerations when installing software

As a rule of thumb, download python packages with conda, use pip whenever possible, download binary tarballs into src/ and copy them to bin/ or download the source tarball and compile it. Example:

conda install \
    samtools

pip install \
    snakemake

wget \
    --continue \
    --output-document src/bin1.tar.gz \
    http://bin1.com/bin1.tar.gz
tar xvf src/bin1.tar.gz
cp src/bin1/bin1 bin/  # or link

wget \
    --continue \
    --output-document src/bin2.tar.gz \
    http://bin2.com/bin2.tar.gz
tar xvf src/bin2.tar.gz
pushd src/bin2/
make -j
cp build/bin2 ../../bin/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smsk: A Snakemake skeleton to jumpstart projects

1. Description

2. First steps

3. File organization

4. Writting workflow considerations

5. Considerations when installing software

Bibliography

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
bin		bin
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml

License

jlanga/smsk_annotate

Folders and files

Latest commit

History

Repository files navigation

smsk: A Snakemake skeleton to jumpstart projects

1. Description

2. First steps

3. File organization

4. Writting workflow considerations

5. Considerations when installing software

Bibliography

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages