Skip to content

Peak calling for ATAC-seq data using contrastive learning on biological replicates

Notifications You must be signed in to change notification settings

Tuteja-Lab/UnsupervisedPeakCaller

Repository files navigation

The Replicative Contrastive Learner (RCL), an Unsupervised Contrastive Peak Caller

This page described the Unsupervised contrastive peak caller known as RCL (Replicative Contrastive Learner). The accompanying publication is available: 10.1101/gr.277677.123.

Table of Contents

  1. Prerequisites
  2. Quickstart
  3. Installation
  4. Tutorial
  5. Input
    1. Tutorial Step 1
  6. Preprocessing
    1. Preprocessing Input
    2. Preprocessing Command-Line Options
    3. Tutorial Step 2
    4. Preprocessing Output
    5. Preprocessing Cleanup
  7. Peak Calling
    1. Peak Calling Input
    2. Peak Calling Command-Line Options
    3. Tutorial Step 3
    4. Peak Calling Output
  8. How to Cite
  9. Contact

Prerequisites

For input preprocessing steps, the following tools and R libraries are required:

For the deep learner step, a GPU is needed. Other packages needed are:

Quickstart

Here is a demo of the steps needed to get started with RCL on a Fedora 39 install:

## install dependencies
# task: install non-python dependencies in the root environment
# task: make sure gpu is recognized
conda create -n rcl		# creating rcl conda environment
conda install -n rcl pytorch	# Pytorch
conda install -n rcl lightning	# Pytorch Lightening
conda install -n rcl scipy scikit-learn pandas
conda activate rcl

## the remaining commands follow tutorial described in this README
git clone https://github.com/Tuteja-Lab/UnsupervisedPeakCaller.git
cd UnsupervisedPeakCaller 
# task: download data as RCLexamples.zip into current directory
unzip -j RCLexamples.zip -d example
cp -r example example.save
# next two commands force fresh run by overwriting existing output files
bash ./preprocessing.bash -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -t 20 -n test
bash ./run_rcl.sh -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -w
diff --brief example example.save
# stored copy of RCL input files appear in example
# random initialization slightly alters RCL fit (example/rcl.ckpt) and scores (example/rcl.bed)

Installation

After installing the prerequisites, all you have to do is clone RCL and move into the root directory of the cloned RCL repository:

git clone https://github.com/Tuteja-Lab/UnsupervisedPeakCaller.git
cd UnsupervisedPeakCaller

Tutorial

The RCL pipeline starts after you obtain reference-aligned read data. The pipeline consists of a data preprocessor and a peak caller, which are covered in detail in this document. To demonstrate the pipeline steps on a small dataset, we have prepared a small tutorial. The tutorial consists of three parts discussed in corresponding parts of this document. Quick links to all three parts are listed here:

  1. Tutorial Step 1: Get the data.
  2. Tutorial Step 2: Preprocess the data.
  3. Tutorial Step 3: Peak calling.

Input

The RCL preprocessor requires BAM files for each replicate. The RCL peak caller requires the output of the preprocessing step. You can read more about peak caller input.

Example: Tutorial Step 1

To demonstrate RCL, we provide the portion of the MCF-7 dataset aligning to human chromosome 10. The output for this example are provided with RCL, so if you want to skip data preprocessing for now, you can go directly to peak calling. Or to compare the output of your run to the tutorial output we obtained, save a copy of the example directory.

Continuing with the preprocessing demonstration, download the necessary BAM files and indices from https://iastate.box.com/s/9uavg2zsy5w0i7v227ei7yaeea7yktr6. If you download the zip file RCLexamples.zip from the cybox and place it in the root of the RCL git repository, the following commands (executed from the root of the RCL git repository) will place them appropriately:

unzip -j RCLexamples.zip -d example

You can skip to Tutorial Step 2.

Preprocessing

We have provided a bash preprocessing script to convert input BAM files (see input) into the required RCL input. The script assumes your data have been aligned to the Ensembl assembly of the mouse or human genome. If not, the script will still run (though it is important you use command line option -g), but no blacklist regions will be removed.

Preprocessing Input

The preprocessing input is the same as the pipeline input.

Preprocessing Command-Line Options

For more information about the preprocessing script type bash ./preprocessing.bash -? from the RCL git repository root. The most important command-line options are mentioned below:

  • -b (DEFAULT: none, you must provide): The BAM files for the individual replicates. You must name at least two BAM files from replicate experiments. Space-separate the names and surround the list with double quotes. For example, the tutorial is run with this options set as -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam".
  • -c (DEFAULT: median): An integer coverage cutoff to identify candidate peaks. The default is to use the minimum median (zero coverage sites excluded) observed across replicates on a per-chromosome basis. However, you can call more peaks by reducing this number. In the RCL publication, we demonstrate that decreasing this cutoff generally enhances RCL performance, but it cannot be less than 1.
  • -d (DEFAULT: example): The data directory, where both input and output will be written.
  • -g (DEFAULT: hg): Indicate the genome reads in the input BAM files are aligned to. The default assumes the reads are aligned to the Ensembl assembly hg38. If you are using the Ensembl assembly of mouse, you should set option -g mm. If you are using another genome, you should name it as you like -g my_id, but not mm or hg. It is very important that you do not leave the default value if your data are not aligned to hg38!
  • -t (DEFAULT: 1): Set the number of threads you would like to use. Most preprocessing steps have been parallelized, so take advantage of it with this command option.
  • -n (DEFAULT: out): Data preprocessing is an expensive operations with many intermediate files. To keep track or maintain multiple versions of those files, name them with this command option. All intermediate and final files will be prefixed with this identifier and placed in the output directory (option -o) if you use the save option (-s). You can reuse these saved files and copy them to the expected input files for RCL by running the preprocessing command again. When the files have already been generated, the preprocessing script will run very quickly. Beware that our logic for checking integrity of intermediate files is imperfect. If your call to preprocessing.bash is killed, intermediate files may be in a corrupt state, which may or may not be detectable. To be sure, either delete all intermediate files (rm example/test* example/chr*/test*, where test is the name -n test you chose for the run) and rerun the preprocessing script or run the preprocessing with the overwrite option (-w).
  • -o (DEFAULT: same as input directory): The directory where intermediate and output files will be stored.
  • -r (DEAFULT: chr): If your genome reference names include "chr" as prefix, you should set this reference prefix to the emptry string "" (two double quotes without space between them).
  • -s (DEFAULT: no): Save the intermediate files generated by the preprocessing script. Also see options -n and -w.
  • -w (DEFAULT: no): Overwrite any files from a previous run of the preprocessing script with the same name (option -n) and the save option -s.

Example: Tutorial Step 2

After following the instructions in Tutorial Step 1 to get and place the data, you can run the preprocessing tool on the sample data as demonstrated below. It is assumed you are at the root of the RCL git repository when you type this command. Also, you should choose an appropriate number of threads for your system via option -t.

bash ./preprocessing.bash -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -t 12 -n test

You can skip to Tutorial Step 3.

Preprocessing Output

The final output of the preprocessing consists of two types of Bed files and a chromosome list.

  1. Bed-formatted (5-column) for each of the replicates and each of the reference sequences containing the coverage for each fixed-length fragment chosen by preprocessing for input to RCL. Specifically, the coverage for replicate REP on reference sequence SEQ is stored in DIR/SEQ/REP.covBga.txt, where REP is the basename (without .bam extension) of an input BAM file, SEQ is a reference sequence found in these BAM files, and DIR is the data directory passed to preprocessing.bash via command option -d.
  2. Bed-formatted (4-column) containing the candidate peak regions that RCL will score. This file is written to bigInputs.txt in your data directory passed to preprocessing.bash via command option -d.
  3. File chrList.txt with the names of the reference sequences (chromosomes) where you would like to call peaks. The names should match the first column of bigInputs.txt. The file is written to the data directory passed to preprocessing.bash via command option -d.

Preprocessing Cleanup

Data preprocessing produces many large intermediate files. By default, these intermediate files are deleted, leaving only the information about the candidate regions needed by the RC peak caller. However, if you use the save option (-s option) all these intermediate files will be saved, prefixed by the chosen name (-n option). To delete these files, carefully use the following command: rm example/test* example/chr*/test*, to clean up a previous preprocessing run with options -d example -n test.

Peak Calling

We provide a bash script run_rcl.sh that fits RCL and assigns predictive peaks scores to each of the candidate regions prepared by the preprocessing step or your own preprocessing pipeline.

Peak Calling Input

The easiest way to get input for RCL peak calling is to use the preprocessing step, but you can prepare input in some other way. Identification of good candidate regions is an important part to the success of RCL. Here are the files you need to provide. All the following files should be in the current directory or a data directory provided by command-line option -d.

  1. A chromosome file (default: chrList.txt): This file contains the names of the reference sequences (chromosomes) where you would like to call peaks. The names in this file should match the first column of all bed files mentioned below. You can change the name of this file with the command option -c, and the file should be found in the data directory.
  2. Bedgraph coverage files: You must provide a bed-formatted (5-column) bedgraph file for each of the replicates and each of the reference sequences (chromosomes) where you want to call peaks. Each replicate file contains the read coverage for the same fixed-length segments chosen for input to RCL. Specifically, the coverage for replicate REP on reference sequence SEQ is expected in file DIR/chrSEQ/REP.covBga.txt. REP is the basename (without .bam extension) of an input BAM file, provided via option -b to run_rcl.sh. SEQ is a reference sequence (chromosome) name, provided in the chromosome file. DIR is the data directory. The extension covBga.txt can be changed with command line option -x.
  3. Candidate region file: You must provide a bed-formatted (4-column) file containing the candidate peak regions that RCL will score. The fixed-length regions are extracted from these candidate regions by our preprocessing script, Stage 5. This file is expected to be called bigInputs.txt in the data directory, but you can override the filename using option -i.

Example: Tutorial Step 3

Now that you have preprocessed the data in Tutorial Step 2, here we demonstrate how to train the model and score the candidate regions for the MCF7, chromosome 10 data. We assume you run this script from the root of the RCL git repository.

bash run_rcl.sh -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -w

The -w flag is necessary to overwrite the results included with the tutorial example. If you want to see the results without running RCL, please look at the file example/rcl.bed and see (RCL output)[#rcl_output].

Peak Calling Command-Line Options

For more information about how to run the script, type run_rcl.sh -?. Parameters that might be important to adjust for obtaining better fits are the number of epochs (command line option -e) and the batch size (command line option -h). Here are other important options:

  • -b (DEFAULT: none, you must provide): The BAM files for the individual replicates. In fact, the BAM files are not used, just the basenames (filenames without .bam extension), so you can just provide the basenames via this argument instead.
  • -c (DEFAULT: chrList.txt): The name of the chromosome file with the reference sequence names you want to call peaks on.
  • -d (DEFAULT: .): The data directory where the input files are located. Output files will also be written here.
  • -e (DEFAULT: 25): Number of epochs for fitting the learner.
  • -h (DEFAULT: 256): Batch size.
  • -i (DEFAULT: bigInputs.txt): Name of the candidate region files.
  • -r (DEFAULT: chr): By default, the bedgraph coverage files are expected in DATA/chrNAME.covBga.txt, where DATA is the data directory and NAME is the name of a reference sequence listed in the chromosome file. You can remove the prefix with command option -r "".
  • -s (DEFAULT: not set): Save the output files.
  • -w (DEFAULT: not set): Overwrite previous results.
  • -x (DEFAULT: .covBga.txt): The extension of the bedgraph coverage files.

Peak Calling Output

The trained model is called rcl.ckpt and results are stored in rcl.bed, both in the directory passed to the -d command-line option. The output will have

chromosome name, peak start position, peak end position, peak name, peak score, training region start position, training region end position, for example

10      49829   50258   10segment1      0.18526842      49543   50543
10      73663   74515   10segment2      0.8270205       73589   74589

How to Cite

Vu, H. T., Zhang, Y., Tuteja, G., & Dorman, K. S. (2023). Unsupervised contrastive peak caller for ATAC-seq. Genome Research, gr-277677.
Link to article: https://genome.cshlp.org/content/33/7/1133.full

Contact

Ha Vu (hhvu@iastate.edu or vthihong@umich.com), Yudi Zhang (yudiz@iastate.edu), or Karin Dorman (kdorman@iastate.edu)

About

Peak calling for ATAC-seq data using contrastive learning on biological replicates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •