Author: | Shiqi Tu |
---|---|
Contact: | tushiqi@picb.ac.cn |
Version: | 1.0.0 |
Date: | 2018-08-24 |
MAnorm2_utils
is designed to coordinate with MAnorm2, an R package for
differential analysis with ChIP-seq signals between two or more groups of
replicate samples. MAnorm2_utils
is primarily used for processing a set
of ChIP-seq samples into a regular table recording the read abundances and
enrichment states of a list of genomic bins in each of these samples.
The primary utility of MAnorm2_utils
comes from the two scripts bound
with it, named profile_bins
and sam2bed
, respectively.
Given the peak regions and mapping positions of reads of each of a set of
ChIP-seq samples, profile_bins
comes up with a list of reference
genomic bins (each being enriched for ChIP-seq signals in at least one of the
samples), and deduces the read count as well as enrichment status of each of
the bins in each sample. Refer to MACS for more information about the
technical terms mentioned above.
We recommend MACS 1.4 for identifying peaks for ChIP-seq samples associated
with narrow genomic regions of reads enrichment (e.g., samples for most
transcription factors and histone modifications like H3K4me3 and H3K27ac). In
fact, although having a general applicability, profile_bins
is
specifically suited to processing the output files generated by MACS 1.4. For
histone modifications constituting broad enriched domains (e.g., H3K9me3 and
H3K27me3), we recommend SICER as the peak caller.
The following is a sample usage of profile_bins
of the simplest form:
profile_bins --peaks=peak1.bed,peak2.bed \
--reads=read1.bed,read2.bed \
--labs=s1,s2 -n example
Note
profile_bins
only recognizes BED-formatted input files. For read
alignment results stored in SAM files, use first sam2bed
to
transform them into BED files before calling profile_bins
(BED files
created by sam2bed
have been specifically designed to suit
profile_bins
; see also the section below). For BAM-formatted
files, refer to Samtools for converting them into SAM files.
If everything goes smoothly, the command above will generate two files, named
example_profile_bins_log.txt
and example_profile_bins.xls
,
respectively. The former records the full list of parameter settings for
calling profile_bins
, as well as some summary statistics regarding each
of the supplied ChIP-seq samples. The latter gives the read count and
enrichment status for each deduced reference genomic bin in each sample, and
has a format like the following (data shown here is only for illustration):
profile_bins
chrom | start | end | s1.read_cnt | s2.read_cnt | s1.occupancy | s2.occupancy |
---|---|---|---|---|---|---|
chr1 | 28112 | 29788 | 115 | 4 | 1 | 0 |
chr1 | 164156 | 166417 | 233 | 194 | 1 | 1 |
chr1 | 166417 | 168417 | 465 | 577 | 1 | 1 |
chr1 | 168417 | 169906 | 15 | 34 | 0 | 1 |
To clarify, a genomic bin is "occupied" by a ChIP-seq sample if and only if its middle point is covered by some peak region of the sample.
profile_bins
supports a number of parameters for a customized
configuration for deducing reference genomic bins as well as counting the reads
falling in them. Type profile_bins --help
in the command line for a
complete list of these parameters and a brief description of each of them.
Among others, several parameters deserve specific attention:
By default,
profile_bins
merges peaks from all the provided ChIP-seq samples into a consensus set of peak regions, and divides up each broad merged peak into consecutive genomic bins. Specify--typical-bin-size
to control the size of such genomic bins. Note that the merged peaks having a size comparable to this parameter are left untouched.The default value of
--typical-bin-size
, which is 2000, suits well the ChIP-seq samples of histone modifications. For ChIP-seq samples of transcription factors, setting the parameter to 1000 is recommended.In cases where summit positions of the supplied peaks are available (e.g., when the peaks are called by using MACS 1.4), you may provide
profile_bins
with this information via specifying--summits
. Summit positions will be used to determine an appropriate start point for dividing up a broad merged peak.Alternatively, you can directly specify a set of genomic regions as the reference bins to profile, by setting
--bins
to a BED file. In this case,profile_bins
focuses on these provided bins and suppresses the peak merging procedure.--typical-bin-size
and--summits
are ignored when--bins
is specified.Before being assigned to reference bins, each read (or read pair) is converted into a genomic locus representing the middle point of the underlying DNA fragment. By default,
profile_bins
treats the supplied reads as single-end, and shifts downstream the 5' end of each of them by--shiftsize
to reach the putative middle point.--shiftsize
defaults to 100, and may be set to half of the practical DNA fragment size selected in the library preparation process.Set
--paired
to indicate the reads are paired-end. In this case, middle point of the underlying DNA fragment associated with each read pair could be accurately inferred. Note that two reads from the same ChIP-seq sample are considered as a read pair only if they have exactly the same name (i.e., the 4th column in a BED file).--shiftsize
is ignored when--paired
is set.--keep-dup
controls the program's behavior regarding duplicate reads (or read pairs) potentially resulting from PCR amplification. For single-end reads, two reads are considered as duplicates if their 5' ends are mapped to the same genomic locus; for paired-end reads, two read pairs are considered as duplicates if their implied DNA fragments occupy the same genomic interval.By default,
profile_bins
preserves all the reads (or read pairs) for the counting procedure. For both paired-end reads and deep-sequencing single-end reads, we strongly recommend setting--keep-dup
to 1 to enhance the specificity of downstream analyses. In that case, for each ChIP-seq sample only one read (or read pair) of a set of duplicates is retained for counting. Note also that the output log file records, for each sample, the ratio of reads (or read pairs) that are removed due to--keep-dup
.profile_bins
supports the idea of using a configuration file to deliver parameters, to avoid repeated typing in the command line. To do that, write a configuration file following the format as demonstrated below, and pass it to--parameters
:peaks=peak1.bed,peak2.bed reads=read1.bed,read2.bed labs=s1,s2 n=example summits=summit1.bed,summit2.bed paired keep-dup=1
Note that
--parameters
could be used in mixture with the other command-line arguments.
Refer to the Manual of MAnorm2_utils for a full specification of the
parameters supported by profile_bins
.
sam2bed
is designed to coordinate with profile_bins
, since the
latter only accepts BED-formatted files. The simplest form of calling
sam2bed
is as follows:
sam2bed -i File.sam -o File.bed
The program will read from the standard input stream if -i
is not
specified.
In the vast majority of cases, the default setting of most of the parameters
supported by sam2bed
should be used.
The only parameter that may be customized in
practice is --min-qual
, which controls the program's behavior
regarding filtering out the SAM alignment records with a low mapping quality.
Type sam2bed --help
in the command line for a brief description of each
parameter supported by sam2bed
.