For generating ab initio predictions, for each spp. following files were gathered. If you have difficulty to install BRAKER, we suggest you to download the singularity container for this step at here.
- Genome sequence (
fasta
format) - RNA-Seq reads aligned to the respective genomes (coordinate sorted
bam
format) - Repeat library (
fasta
format)
- Repeat mask genomes using
Repeatmasker
- Use the hard-masked genomes for BRAKER predictions, using RNA-Seq as evidence.
- Tidy the
augustus.hints.gtf
file using genometools - Run
TE-sorter
to find BRAKER predictions (genes/transcripts), that contain TE.
Using the script runRpepeatmasker.sh
, RepeatMasker was run to find all the repeats in the genome. The obtained gff3
of repeats was then used to generate soft-masked and hard-masked genome sequences.
runRepeatmasker.sh <genome-file> <repeat-library>
bioawk -c fastx '{print ">"$name"\n"toupper($seq)}' original-genome.fna > genome.fasta
##above code not required - above code changes the original-genome from a softmasked genome to a non-masked genome
bedtools maskfasta -fi genome.fasta -bed repeat-masker.out.gff -fo genome.hardmasked.fasta
If you have difficulty to install BRAKER, we suggest you to download the singularity container for this step at here.
Required input to use the runBRAKER_singularity.sh
script: (1) path to Braker Singularity Image, (2) path to RNA-Seq BAM file and (3) path to hard masked genome.
The output files that are important are within the barker
folder. Following files are important:
augustus.hints.gtf
: final predictions (most important file)augustus.hints.codingseq
: transcriptsaugustus.hints.aa
: proteins
Using the program genometools
with the gtf_to_gff3
option to generate a clean version of a gff3 file for further filtering. Note: In my experience - somethings genometools
did not work. Review the Challenges ReadMe file.
module load genometools
gt gtf_to_gff3 -tidy -o augustus.hints-clean.gff3 augustus.hints.gtf
Create an CDS file using Cufflinks gffread from the augustus.hints.gtf
file.
module load cufflinks
gffread augustus.hints.gtf -g /path/to/the/genome_main.fasta -x augustus.hints.CDS.fasta
Use augustus.hints.CDS.fasta
for TE-sorter run as follows:
TEsorter.py -p 20 -db rexdb-plant augustus.hints.CDS.fasta
The result file augustus.hints.CDS.fasta.rexdb-plant.cls.tsv
provides the list of transcript ids that has potential TE elements in it. Review the script filter_TE.md to see how I filtered the TE sequences and generated a TE_Filtered GFF3 for the final compare & filter step