- About the Project
- Datasets
- Built with
- Software
- Binaries
- File Structure
- Files
- Folders
- Usage
- Roadmap
- Versioning
- Authors
- Acknowledgments
This is the development repository for a pipeline created to perform frequency analysis on African genetic datasets. This work is licensed under a Creative Commons Attribution 4.0 International License.
This pipeline is designed to accept variant call format data in the form of .vcf
files. Due to some of the
bioinformatics software used internally, these files are required to be compressed using BG-zip compression, and
provided with an accompanying Tabix index file. Both of these peices of software are provided by
Samtools, a standard Bioinformatics software. These two files provide a block-level
compressed format of your data, and a block index, allowing the software to decompress portions of your file and access
spesific entries without having to decompress the entire file.
This is also just good practice and should be a bioinformatics software standard
Please be advised, BG-zip compression is not the same as gzip compression such as that provided by linuxes gzip command. Though the final output is still block-level compression and is operable by both programs, you will need BG-zip compression in order to create a Tabix index.
This has been made using a python-based domain spesific language
(DSL) called
Snakemake and coded to run on a PBS/Torque environment using the qsub
command (this is set by the profile folder). As such, it needs to be run on a server with the appropriate binaries and
batch scheduling software.
Below is a list of software used by this pipeline:
- PBS/Torque batch scheduler
- Snakemake
- PLINK-2.0
- VCF-Tools
- liftOverPlink(Binaries contained within this repo. Update at own risk!)
- liftOver(Required dependancy for liftOverPlink)
- e! Ensembl VEP API
Below is a list of binary dependancies used in this pipeline.
- Reference Genomes (properly compressed with accompanying index and dictionary files)
- GRCh 38
- Addittional genomes as needed based on input data
This pipeline uses the standardised folder structure, where the workflow itself is located under the workflow
folder.
.
├── config # All config data (PBS Profile, genes, etc)
├── resources # Commonly used resources (WARNING: DEPRECIATING SOON)
├── results # The output of the pipeline
├── workflow # The entrypoint to the code of the pipeline
└── README.md
This project uses the following naming conventions:
All user generated files should be named using under-score naming conventions. Spaces are replaced with an underscore and co capital letters are used.
E.g. this_is_a_test_example.txt
All Snakemake generated files are all labeled according to <sample_name>.<file-extension>
format and stored in a
folder named according to the process that produced it. > E.g. intermediates/liftover/1000g.vcf
All user generated folders should use camelCase naming conventions, where the first letter of a multi-word name is lower-case and spaces are removed with the initial letter of the following word capitalised.
E.g. thisIsATestExample
All snakemake generated folders use the following folder structure:
.
└── intermediates
└── <ruleName>
└── <file_name>.<extension>
└── <file_name>.<extension>
└── <file_name>.<extension>
- use the
cd
command to navigate to the root repository directory containing theSnakefile
. - To start the pipeline and produce the default list of files, simply call
snakemake
on the command line with appropriate arguments. (E.g.--profile
and--cluster-config
flags) - To generate a runtime report, detailing figures produced and performance-related numbers, use the
--report
snakemake flag (This requires that you have theJinja2
python package installed.). The HTML file produced is completely self-contained and can be shared as needed. You can view it using any web browser such as firefox or Google Chrome, etc.
See our Projects tab and Issues tracker for a list of proposed features (and known issues).
We use the SemVer syntax to manage and maintain version numbers. For the versions available, see the releases on this repository here.
Many thanks to the following individuals who have been instrumental to the success of this project:
|
|
|
|
|