Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

MACS2 segmentation fault due to chromosome names #170

Open
jnktsj opened this issue Jan 10, 2017 · 2 comments
Open

MACS2 segmentation fault due to chromosome names #170

jnktsj opened this issue Jan 10, 2017 · 2 comments

Comments

@jnktsj
Copy link

jnktsj commented Jan 10, 2017

Hi, I just wanted to let you know that MACS2 failed to finish when reference chromosome names contain symbols (e.g. 2L52.1a:gene=WBGene00007063, ZK856.2:type=Transposon-mRNA:gene=WBGene00023448, etc). It stops right after #3 Pre-compute pvalue-qvalue table. I used BED for the run.

@taoliu
Copy link
Contributor

taoliu commented Mar 22, 2017

Hi @jnktsj, I just tested a BED file with about 200k lines and saw no problem. I gave a weird chromosome name prefix '123abc: +-_[]{})(/:.,?!^='. Could you share me your BED file?

(TLPython)tliu4@srv-u10-28:1$macs2 --version
macs2 2.1.1.20160309
(TLPython)tliu4@srv-u10-28:1$head test.bed
123abc: +-_[]{})(\/:.,?!^=1	564917	565017	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	565533	565633	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	565844	565944	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	565858	565958	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	566132	566232	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	566212	566312	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	566736	566836	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	566976	567076	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	567211	567311	.	.	+
123abc: +-_[]{})(\/:.,?!^=1	567320	567420	.	.	+
(TLPython)tliu4@srv-u10-28:1$macs2 callpeak -t test.bed
INFO  @ Wed, 22 Mar 2017 12:07:28:
# Command line: callpeak -t test.bed
# ARGUMENTS LIST:
# name = NA
# format = AUTO
# ChIP-seq file = ['test.bed']
# control file = None
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps
# Broad region calling is off
# Paired-End mode is off

INFO  @ Wed, 22 Mar 2017 12:07:28: #1 read tag files...
INFO  @ Wed, 22 Mar 2017 12:07:28: #1 read treatment tags...
INFO  @ Wed, 22 Mar 2017 12:07:28: Detected format is: BED
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 tag size is determined as 100 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 tag size = 100
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  total tags in treatment: 199977
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 user defined the maximum tags...
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s)
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  tags after filtering in treatment: 199583
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  Redundant rate of treatment: 0.00
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 finished!
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 Build Peak Model...
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 looking for paired plus/minus strand peaks...
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 number of paired peaks: 4488
INFO  @ Wed, 22 Mar 2017 12:07:29: start model_add_line...
INFO  @ Wed, 22 Mar 2017 12:07:29: start X-correlation...
INFO  @ Wed, 22 Mar 2017 12:07:29: end of X-cor
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 finished!
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 predicted fragment length is 254 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 alternative fragment length(s) may be 254 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #2.2 Generate R script for model : NA_model.r
INFO  @ Wed, 22 Mar 2017 12:07:29: #3 Call peaks...
INFO  @ Wed, 22 Mar 2017 12:07:29: #3 Pre-compute pvalue-qvalue table...
INFO  @ Wed, 22 Mar 2017 12:07:30: #3 Call peaks for each chromosome...
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write output xls file... NA_peaks.xls
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write peak in narrowPeak format file... NA_peaks.narrowPeak
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write summits bed file... NA_summits.bed
INFO  @ Wed, 22 Mar 2017 12:07:30: Done!
(TLPython)tliu4@srv-u10-28:1$head NA_peaks.narrowPeak
123abc: +-_[]{})(\/:.,?!^=1	919428	919740	NA_peak_1	40	.	5.20653	7.81607	4.03555	121
123abc: +-_[]{})(\/:.,?!^=1	1185654	1185908	NA_peak_2	37	.	5.09424	7.42382	3.73522	125
123abc: +-_[]{})(\/:.,?!^=1	1307407	1307661	NA_peak_3	40	.	5.20653	7.81607	4.03555	193
123abc: +-_[]{})(\/:.,?!^=1	2126579	2126833	NA_peak_4	28	.	4.33877	6.21931	2.81285	68
123abc: +-_[]{})(\/:.,?!^=1	2246572	2246949	NA_peak_5	37	.	5.09424	7.42382	3.73522	139
123abc: +-_[]{})(\/:.,?!^=1	2313278	2313532	NA_peak_6	37	.	5.09424	7.42382	3.73522	74
123abc: +-_[]{})(\/:.,?!^=1	2345913	2346167	NA_peak_7	58	.	6.64894	10.22048	5.85201	114
123abc: +-_[]{})(\/:.,?!^=1	2479551	2479806	NA_peak_8	61	.	7.03455	10.65279	6.16741	106
123abc: +-_[]{})(\/:.,?!^=1	3640928	3641354	NA_peak_9	98	.	9.41347	15.39576	9.89543	205
123abc: +-_[]{})(\/:.,?!^=1	3773721	3774031	NA_peak_10	61	.	6.79233	10.67463	6.16741	154

@taoliu
Copy link
Contributor

taoliu commented Mar 22, 2017

My suspect is that the problem is not caused by chromosome name, but the number of "chromosomes". The current hash table used in MACS2 is not optimized for large number of "chromosomes". To save computational speed, each chromosome will have minimum 100K data points initially, so if you have 50k 'chromosomes', there will be a minimum number of 5G datapoints. If you multiply the number with the size of data structure, the final memory usage will be over the physical memory size of your computer. To tweak this, go to find the 'Constants.py' file inside MACS2 source code directory and modify the 'BUFFER_SIZE' to a small number such as 100 than install MACS2 again. Don't forget to remove the old installation entirely.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants