ATACseq_pipeline

This is an automatic pipeline of processing ATAC-seq data. The pipeline will need the user to provide the raw ATACseq sequence data and will do the following steps including adapter-trimming, alignment, filtering, plotting chromosome pattern, getting the coverage depth, generating .bed filea, comparing signal to noise and summarizing all the previous steps.

The pipeline wil return a summary file to conclude the running process, namely summary_file_BatchXX_RunXX.txt.

The pipeline was contributed by Yuhua Zhang, Alex Tsoi and Matthew Patrick.

Usage of the ATACseq_pipeline:

Running the pipeline

python ATACseq_pipeline.py -c <config_file>

Generating template files to modify

python ATACseq_pipeline.py --template

For help

python ATACseq_pipeline.py --help

Component of config_file:

Procedures to be involved, start with ##, e.g. ##alignment
Parameters for the corresponding procedure, start with --,e.g. --batch 14

A quick start

In this case, the program will go through the whole procedure, and generate the intermediate files at each step and final results. The user need to specify at least the required parameters.

The user will need to provide several files include

config files for the whole pipelines;
bedprofile for bedprofilecounts;
config files for gtcloud alignment;

*4. specific output for intermediate files if required;

Note that the template for these files can be generated by python ATACseq_pipeline.py --template. The user can modify the parameters as they like.

Required parameter

--seq_data: Pathway to the sequence data, end with Run_XXX, no '/' included at the end; the program will search for '/elder' folder and find the sequence data, e.g. --seq_data /pathway/to/data/Run_XXX

--core_info_file: The core information files, e.g. /pathway/to/file/Batch14_Run_1789_elder.txt;

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

--conf: The config file used by gotcloud. Please don't specify OUT_DIR and FASTQ_LIST in the config file. e.g. --conf /patheway/to/config_file

--bedprofile: The paramters specified for the bedprofilecounts. e.g. --bedprofile /pathway/../bedprofile

Optional parameter

--specific_output: The output pathway file that contains the specific output for each step; e.g. --specific_output /pathway/to/file. The output pathway file should contain the specific pathway to store the intermediate files generated by the pipeline, specified by --out_sampleinfo --out_bam --out_proc_bam --out_plot --out_coverage --out_bed --out_clus --out_s2n --out_summary. If not specified every output pathway, the default will be the current pathway. The user can use python ATACseq_pipeline.py --template to generate a template for this file.

--entire_output: Where you would like to store the all the output(intermediate files and final results). The default is the current direrctory. Several directories will be created under the entire_output directoies, including /BAM, /BAMprocessed, /BED, /QCs.

--job_AT: Number of process for adapter trimming step, the default is 5;

--job_align: Number of process for alignment, the default is 3. Please pay attention to the memory, as the demand for the memory is considerably high;

--job_filter: Number of process for filtering, the default is 5;

--intermediate_file: Whether to keep the intermediate .bam files generated in the filtering step, the default is no. e.g. --intermediate_file Yes

--trim_read: Whether to trim the reads to a certain length, the default is No. e.g. --trim_read Yes

--reads: If trim_read is Yes, set the number of reads you want to keep. The default is 38.

--direct: Set the direction from which to trim the reads. The options are 'l' and 'f'. The default is 'f'.

A brief example of config_file

--entire_output .

--specific_output ./output_file

--batch 14

--run 1789

--job_AT 5

--seq_data /pathway/to/seq/data

--core_info_file /core/info/file.txt

--job_align 5

--conf /pathway/to/config_file

--job_filter 5

--intermediate_file Yes

--bedprofile /pathway/../bedprofile

Separate each step

If the user prefer to run each procedure step by step and specify the output pathway for the intermediate files, just make the config_file contains only one operation each time.

Adapter_trimming

This step will generate the trimmed files under the same directory as the input sequence data, as well as the extracted information from the input core info file(sampleID, cell line description and patient ID).

--core_info_file: The core information files, e.g. /pathway/to/file/Batch14_Run_1789_elder.txt;

--seq_data: Pathway to the sequence data, end with Run_XXX, no '/' included at the end; the program will search for '/elder' folder and find the sequence data, e.g. --seq_data /pathway/to/data/Run_XXX

--job_AT: Optional parameter. Number of process for adapter trimming step, the default is 5;

--out_sampleinfo: Optional parameter. Directory to store the generated sample info file, e.g. out_sampleinfo /pathway/..

Example

##adapter_trimming

--core_info_file /core/info.txt

--job_AT 5

--seq_data /pathway/to/seq/data

--output_AT_core_info /pathway/to/store

Alignment

This step will generate the aligned .bam files as well as a metaCloudbamfiles which will be used for the filtering. The .bam files will be stored in the pathway either specified by the user or by default the current disrctory. The metaCouldbamfiles will be stored in the same pathway as the config file for gotcloud.

--trimmed_file: Pathway to the trimmed files, end with Run_XXX, no '/' included at the end; the program will search for '/elder' folder and find the sequence data;

--out_conf: Directory to store config and index files that would be used by gotcloud;

--out_bam: Directory to store the output bam files. A new directory named as BatchXX_RunXXXX will be created to store the generated .bam files;

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

--conf: The config_file used by gotcloud, e.g. --conf /pathway/to/config_file;

--job_align: Optional parameter. Number of process for alignment, the default is 3. Please pay attention to the memory, as the demand for the memory is considerably high;

Example

##alignment

--job_align 5

--trimmed_file /pathway/to/adapter/trimmed/file

--out_conf /directory/to/store/config/index/files/for/gotcloud

--out_bam /output/directory/to/store/generated/bam/file

--batch 14

--run 1789

--conf /pathway/to/config_file

Filtering

This step will generate the filtered .bam file as well as the metagotCloudbamfiles which will be used in the plotting step. Whether to keep the intermediate files is up to the users.

--in_bam: MetagotCloud files. e.g. --in_bam /pathway/metaXXXX;

--out_proc_bam: Output dir for processed .bam files; A new directory named as BatchXX_RunXXXX will be created to store the processed .bam files

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

--job_filter: Optional parameter. Number of process for filtering, the default is 5;

--intermediate_file: Optional parameter. Whether to keep the intermediate .bam files generated in the filtering step, the default is no. e.g. --intermediate_file Yes

Example

##filtering

--job_filter 5

--out_proc_bam /pathway

--in_bam /pathway/metagotCloudbamfiles_Batch14_Run1789

--intermediate_file Yes

--batch 14

--run 1789

Plotting

This step will generate the plots of the frequency of the insert sizes, compared the filtered reads to the unfiltered reads.

--in_bam: Directory to MetagotCloud files for unfiltered reads, the pipeline will search for the metagotcloud files based on the run and batch number provided by user. e.g. _--in_bam /pathway/..;

--in_bam_filtered:Directory to MetagotCloud files for filtered reads, the pipeline will search for the metagotcloud files based on the run and batch number provided by user. e.g. --in_bam /pathway/metaXXXX;

--out_plot: The directory to store the generated plots, A new directory named as BatchXX_RunXXXX will be created to store the generated plots;

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

Example

##plotting

--in_bam /pathway/meta_file

--in_bam_filtered /pathway/meta_filtered_file

--out_plot /pathway/plot

--batch 14

--run 1789

getting_coverage

This step will generate the summary table of the coverage depth in the whole genome. The coverage depth will check 1X, 2X and 5X correspondingly.

--dir_filtered_bam: The directory to the filtered .bam files (Please include BatchXX_RunXX) e.g. --dir_filtered_bam /pathway/../BatchXX_RunXXXX;

--out_coverage: The output directory where you would like yo store the output, e.g. --out_coverage /pathway/to/store/output;

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

Example

##getting_coverage

--dir_filtered_bam /path/way

--out_coverage /pathway/for/output

--batch 14

--run 1789

making_bed

This step will generate the bed file for both filtered and unfiltered .bam files.

--dir_filtered_bam: The directory to the filtered .bam files, e.g. --dir_filtered_bam /path/way;

--dir_bam: The directory to the unfiltered .bam files, e.g. --dir_bam /path/way;

--out_bed: The output directory where you would like yo store the output, the pipeline will generate a directory named BatchXX_RunXXXX under the directory provided by the user. e.g. --out_bed /pathway/to/store/output;

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

Example

##making_bed

--dir_filtered_bam /path/way

--dir_bam /path/way

--out_bed /pathway/to/store/output

--batch 14

--run 1789

signal_to_noise

This step will generate the plots of signal to noise;

--bedprofile: The bedprofile that will be used by bedprofilecount, the template can be generated by the pipeline. e.g. --bedprofile /pathway/../bedprofile;

--dir_bed: The directory to the .bed files, e.g. --dir_bed /path/way;

--out_clus: The output directory where you would like yo store the .clus file, e.g. --out_bed /pathway/to/store/output;

--sample_info: The previous generated sampleInfo file, e.g. --sample_info /pathway/../SampleInfo_BatchXX_RunXXXX;

--out_s2n: The output directory to store the plot of signal to noise, e.g. --out_s2n /pathway/to/store/plot

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

Example

##signal_to_noise

--bedprofile /pathway/../bedprofile

--dir_bed /path/way

--out_clus /pathway/to/store/output

--sample_info /pathway/../sample_info

--out_s2n /pathway/to/store/plot

--batch 14

--run 1789

summary

This step will generate the summary of the quality control steps;

--dir_sampleinfo: The directory to the sampleinfo file generated before, the pipeline will search for SampleInfo_BatchXX_RunXXXX under the directory provided. e.g. _--dir_sampleinfo /pathway/..;

--dir_proc_bam: The directory to the processed .bam files, the pipeline will search for readcount_BatchXX_RunXXXX under the directory provided. e.g. _--dir_proc_bam /pathway/..;

--dir_coverage: The directory to the previous summary table of coverage depth, the pipeline will search for *BatchXX_RunXXXX.table under the directory provided. e.g. _--dir_coverage /pathway/..;

--dir_s2n: The directory to the previous summary table of signal to noise, the pipeline will search for *BatchXX_RunXXXX.signal2noise under the directory provided by user. e.g. _--dir_s2n /pathway/..;

--out_summary: The output directory to store the summary table, e.g. _--out_summary /pathway/..

--batch: The batch number, e.g. --batch 14;

--run: The run number, e.g. --run 1789;

Example

##summary

--dir_sampleinfo /pathway/..

--dir_proc_bam /pathway/..

--dir_coverage /pathway/..

--dir_s2n /pathway/..

--out_summary /pathway/..

--batch 14

--run 1789

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ATACseq_pipeline

Usage of the ATACseq_pipeline:

Component of config_file:

A quick start

Separate each step

Files

README.md

Latest commit

History

README.md

File metadata and controls

ATACseq_pipeline

Usage of the ATACseq_pipeline:

Component of config_file:

A quick start

Separate each step