-
Notifications
You must be signed in to change notification settings - Fork 26
1000 Genome CEU Trio Analysis
The BAM files used in this analysis are available from:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120117_ceu_trio_b37_decoy/
A copy of the calls used in our Bioinformatics paper are on our ftp site. The version of RetroSeq used to produce the comparison table in the paper was v1.32.
Reference genome: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/
The Alu and L1 BED files are derived directly from Repeatmasker. A copy of the BED files used in the analysis are here: ftp://ftp-mouse.sanger.ac.uk/other/tk2/RetroSeq/hg19/
Also, we used Alu and L1 sequence files to increase the sensitivity of the discover stage. These were derived directly from Repbase. A copy of the files used is here: ftp://ftp-mouse.sanger.ac.uk/other/tk2/RetroSeq/hg19/hg19_probes.tgz
The command lines used for the discovery stages were:
retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align
retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align
retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align
The -refTEs input file should be in the format <TE_name>:
Alu /home/me/data/Alu.bed
L1 /home/me/data/L1.bed
The -eref option input file should be in the format <TE_name>:
Alu /home/me/data/Alu.fasta
L1 /home/me/data/L1.fasta
The command lines used for the calling phase were:
retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400
retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400
retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400
The final calls were filtered in two ways to produce the final callsets.
First remove calls that are very close to reference annotated repeat elements. This was done using bedtools 'window' command:
Alu
bedtools window -b Alu.bed -a CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.Alu.vcf -v -w 100 > NA12878.ref-filtered.Alu.vcf
L1
bedtools intersect -b Alu.bed -a CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.L1.rm_ref.vcf -v > NA12878.ref_filtered_alu.L1.vcf
bedtools window -b L1_HS.bed -a NA12878.ref_filtered_alu.L1.vcf -v -w 200 > NA12878.ref_filtered_alu.ref_filtered_L1.L1.vcf
Finally, we selected calls from the VCF file with the following INFO tags:
FL=6 & GQ>=28
FL=7 & GQ>=20
FL=8 & GQ>=20