Skip to content

Pipeline for scaffolding genome assemblies using haplotagging reads


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



39 Commits

Repository files navigation

scaffhtag v2.0

Pipeline for scaffolding genome assemblies using haplotagging reads.

Pipeline steps:

Scaffolding with scaffhtag:
  1 Barcoded tags are extracted from htag raw sequencing reads and appended 
      to read names for further processing
  2 The reads are mapped to the draft assembly using either BWA or SMALT
  3 Barcodes are sorted together with contigs as well as mapping coordinates
  4 A relation matrix is built to record the shared barcodes among the contigs which may be linked
  5 Order and orientation of linked contigs are determined after nearest neighbours are found. 

Download and Compile:

Requirements for compiling: gcc gcc-4.9.2 or late:

If you see this message, cc1: error: unrecognised command line option ‘-std=c11’ make: *** [breakhtag.o] Error 1

you need a higher version of gcc CC= /software/gcc-4.9.2/bin/gcc in the makefile

$ git clone 
$ cd scaffhtag
$ bash

If everything compiled successfully you must see the final comment: "Congrats: installation successful!"

External packages

The genome aligner BWA ( and SMALT ( are downloaded and compiled by htag.

Run the pipelines

Prepare read files with barcode error correction and extraction

       $ /full/path/to/htag/src/scaff_read input.dat htag-reads_BC1.fastq.gz htag-reads_BC2.fastq.gz \
       input.dat               - input a text file to point the locations of the reads in cram files \
           htag-reads_BC1.fastq.gz - output read file                       \
       htag-reads_BC1.fastq.gz - output read file                      \

       input.dat file shoul be like with full path:
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#17.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#18.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#19.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#20.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#21.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#22.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#23.cram \
	/lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/43969#24.cram \ 

Run scaffhtag with paired reads htag-reads_BC1.fastq.gz htag-reads_BC2.fastq.gz:

       $ /full/path/to/htag/src/scaffhtag -nodes <nodes> -align <aligner> -score <score> \
   	 -matrix <matrix_size> -read-s1 <min_reads_s1> -read-s2 <min_reads_s2> \
	 -edge <edge_len> -link-s1 <n_links_s1> -link-s2 <n_links_s2> -block <block>  \
	 [ -mkdup Dupmarked.bam ] [ -plot barcode-length.png ] \
	 draft-assembly.fasta htag-reads_BC1.fastq.gz htag-reads_BC2.fastq.gz output_scaffolds.fasta

         nodes:        number of CPUs requested  [ default = 30 ]
         score:        averaged mapping score on each barcode fragment [ default = 20 ]
         aligner:      sequence aligner: bwa or smalt [ default = bwa ]
         matrix_size:  relation matrix size [ default = 2000 ]
         min_reads_s1: step 1: minimum number of reads per barcode [ default = 10 ]
         min_reads_s2: step 2: minimum number of reads per barcode [ default = 10 ]
         edge_len:     length of mapped reads to consider for scaffolding [ default = 50000 ]
         n_links_s1:   step 1: minimum number of shared barcodes [ default = 8 ]
         n_links_s2:   step 2: minimum number of shared barcodes [ default = 8 ]
         aggressive:   1 - aggressively mapping filtering on small PacBio/ONT contigs; 
     		   0 - no aggressive for short read assembly  [ default = 1 ]
         block:        length to determine for nearest neighbours [ default = 50000 ]
         plot:         output image file with barcode length distributions and coverage stats 
     mkdup:        output bam file with duplicated reads removed \n"); 

Run scaffhtag with aligned and sorted bam file: aligned.bam

       $ /full/path/to/htag/src/scaffhtag -nodes <nodes> -plot barcode-length.png \
         -bam /lustre/scratch117/sciops/team117/hpag/zn1/aligned.bam \
         draft-assembly.fasta output_scaffolds.fasta \

Run alignment with ema:

       $ /full/path/to/htag/src/scaff-bin/ema-align.csh <input_cram_file> \ 
         <Output_workdirectory> <bwa_index> <output_bam_file> \
Instructions for Installation
   Tools needed 
     1. samtools version 1.15 or later 
     2. SamHaplotag  
     3. bwa version 0.7.12-r1044 or later
     4. ema 
 Use of bioconda for installation 
 Reference index 
 -- Say you have a reference genome assembly  
   cd /lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/bindex/
   bwa index Oak-chr.fasta
   samtools faidx Oak-chr.fasta
 -- Say you have cram file 43969#17.cram and Oak-chr.fasta index 
   /nfs/users/nfs_z/zn1/bin/ema-align.csh 43969#17.cram readsplit-17 \
       /lustre/scratch117/sciops/team117/hpag/zn1/project/HiC/QC/run-43969/oak1/bindex/Oak-chr.fasta \ 
 Your output file ema_final-17.bam will be in readsplit-17.  


Pipeline for scaffolding genome assemblies using haplotagging reads







No releases published


No packages published
