This repository contains several scripts for bioinformatics.
git clone https://github.com/sc-zhang/bioscripts.git
cd bin
chmod +x *
# Optional, add following line to your ~/.bash_profile
export PATH=/path/to/bioscripts/bin:$PATH
- approximate_cnv.py is a script for approximating CNV (Copy Number Variation) with read depth.
approximate_cnv.py -bam <bam_list_file> -g <genome_size> -l <read_length> -bed <bed_file> -o <out_file> [-t <thread_nums>]
Usage:
-bam: a list file, each line is the full path of a bam file
-g: the size of genome, integer
-l: the length of read, integer
-bed: bed file contain 4 columns: chromosome, start position, end position, gene name, seperate with tab
-o: result file
-t: threads, integer
- average_fpkm.py is a script for calculating average of fpkm values.
# Dependencies
# Python modules: numpy
average_fpkm.py <in_fpkm> <out_avg>
- blast2heatmap.py is a script for drawing heatmap with blast file of format 6.
# Dependencies
# Software: R, bedtools
# R modules: pheatmap
blast2heatmap.py <ref_fasta> <blast_file> <window_size> <out_name> <threshold_identify> <threshold_match>
- calc_gap_cnt.py is a script for calculating gap count of all sequences.
calc_gap_cnt.py <in_fa>
- calc_gene_ovlp_te.py is a script for calculating overlap ratio of genes with TE regions.
calc_gene_ovlp_te.py <gene_gff3> <TE_gffs> <ovlp_stat>
Usage:
ovlp_stat: is the output file.
- convert_collinearity_from_MCScanX_to_Circos.py is a script for converting collinearity file from MCScanX result to link file for Circos
convert_collinearity_from_MCScanX_to_Circos.py <collinearity_file> <gff_file> <out_file>
- convert_gbff_to_fasta.py is a script for converting NCBI GBFF file to fasta file.
convert_gbff_to_fasta.py <in_gbff> <out_fasta>
- convert_QTL_info.py is a script for converting QTL information of contig-level to chromosome-level with agp file.
convert_QTL_info.py <in_QTL> <in_agp> <out_QTL>
- convert_simple_for_circos.py is a script for converting JCVI simple file to link file for circos.
convert_simple_for_circos.py <in_simple> <in_gff3_files> <out_link>
- dup_dotplot.pl is a script for plotting dotplot with monoploid and polyploid.
dup_dotplot.pl -g reference_genome -r ref_id -q query_id -n number_of_dup -t threads
Usage:
ref_id: reference cds and bed name, like: Sb, Sb.cds and Sb.bed must exist
query_id: query cds and bed name, like: Os
number_of_dup: number of duplications
threads: default 1
- eval_filled_gaps.py is a script for evaluating status that gaps been filled
eval_filled_gaps.py <ref_fasta> <query_fasta> <result_file>
- extract_all_sv_from_nucmer_delta.py is a script for extracting SV from delta file generated by nucmer.
extract_all_sv_from_nucmer_delta.py <in_delta> <out_pre>
- extract_gene_from_gff.py is a script for extracting genes from gff3 file with gene id list and generating a bed file.
extract_gene_from_gff.py <in_list> <in_gff> <out_bed>
- extract_vcf.py is a script for extracting vcf with bed file
extract_vcf.py <in_vcf> <in_bed> <out_vcf>
- filter_cds.py is a script for removing invalided CDS sequences.
filter_cds.py <in_cds> <out_cds>
- find_gff_ovlp_regions.py is a script for getting overlap regions from gff3 file.
find_gff_ovlp_regions.py <in_gff3> <out_bed>
- get_chr_len.py is a script for calculating length of chromosomes in fasta file
get_chr_len.py <fasta_file> <output_file> <T/F chr only>
- get_genes_from_range.py is a script for getting genes with bed file.
get_genes_from_range.py <gff3_file> <bed_file> <output_file> <threshold>
- get_genes_region_from_gff.py is a script for getting gene regions from gff3 file.
get_genes_region_from_gff.py <gene_list> <in_gff> <out_bed>
- get_gff_with_list.py is a script for extracting gff3 file with gene IDs.
get_gff_with_list.py <in_gff> <in_list> <out_gff>
- get_seq_from_range.py is a script for extracting sequence fragments with bed file.
get_seq_from_range.py <in_fasta> <in_bed> <out_fasta>
- group_exon_and_intron.py is a script for classifying vcf positions to exon and intron.
group_exon_and_intron.py <input_gff> <input_vcf> <output_file>
- group_SNP_exon_and_intron.py is a script for classifying SNP positions to exon and intron.
group_SNP_exon_and_intron.py <input_gff> <input_snp> <output_file>
- merge_bed_regions.py is a script for merging bed files based on distance
merge_bed_regions.py <in_bed> <out_bed> <max_distance>
- modify_geno_with_snp_mummer.py is a script for modifying columns in geno file with snp result generated by show-snps of mummer
modify_geno_with_snp_mummer.py <in_geno> <in_snp> <col> <out_geno>
- nucmer_extract_all_sv.py is a script for running nucmer and extracting all SV.
# Dependencies
# Software: nucmer
nucmer_extract_all_sv.py <ref_fasta> <query_fasta> <out_pre> <threads>
- nucmer_statistics.py & nucmer_statistics_all_sv.py are scripts for running nucmer and generating statistics.
nucmer_statistics.py <ref_fasta> <query_fasta> <out_pre> <threads>
nucmer_statistics_all_sv.py <ref_fasta> <query_fasta> <out_pre> <threads>
- quick_extract_fastx.py is a script for extracting fasta or fastq file with list.
quick_extract_fastx.py <in_fastx|gz> <in_list> <out_fastx|gz>
- quick_mask_genome.py is a script for masking genome with bed file.
quick_mask_genome.py <in_fasta> <in_bed> <out_fasta> <threshold> <threads>
- remove_region_by_blast_result.py is a script for removing regions in chromosomes with blast results.
remove_region_by_blast_result.py <blast_results> <chr_len> <out_bed>
Usage:
<blast_results> is a list of blast files seperated with comma
- rename_ID.py is a script for sorting and renaming id with in_gff file, and renaming id in fasta files.
rename_ID.py <chr_prefix> <in_gff> <out_gff> <in_fasta> <out_fasta>
- SentieonSNP_filter.py is a script for filtering vcf result generated by Sentieon.
usage: SentieonSNP_filter.py [-h] -b BASE -v VALIDATION [-r REPEAT] -o OUTPUT [-m MISSING_RATE] [-d MIN_DISTANCE]
options:
-h, --help show this help message and exit
-b BASE, --base BASE Input vcf file as base
-v VALIDATION, --validation VALIDATION
Input vcf file as validation
-r REPEAT, --repeat REPEAT
Repeat regions file, gff format
-o OUTPUT, --output OUTPUT
Output vcf file based on base vcf file, compressed with gzip
-m MISSING_RATE, --missing_rate MISSING_RATE
Missing rate threshold, percentage, default: 40
-d MIN_DISTANCE, --min_distance MIN_DISTANCE
Minimum distance between two snp sites, default: 0
- SeqStat.py is a script for generating statistics with fasta|fastq|bam file.
SeqStat.py <in_file> [out_stat]
- SimContigs.py & SimCollapse.py are scripts for simulating collapsed contigs.
usage: SimContigs.py [-h] [--min MIN] [--max MAX] [-n N50] -i INPUT -o OUTPUT
options:
-h, --help show this help message and exit
--min MIN minimum length of contig, default: 15k, you can use both number or string end with k,m
--max MAX minimum length of contig, default: 5m, you can use both number or string end with k,m
-n N50, --n50 N50 size of N50, default: 500k, you can use both number or string end with k,m
-i INPUT, --input INPUT
origin fasta file of genome
-o OUTPUT, --output OUTPUT
filename of simulated data
usage: SimCollapse.py [-h] -a A_CONTIGS -b B_CONTIGS -p PREFIX -o OUTPUT -s BLAST [-c COLLAPSE]
options:
-h, --help show this help message and exit
-a A_CONTIGS, --a_contigs A_CONTIGS
first fasta file contain contigs generated by SimContigs.py
-b B_CONTIGS, --b_contigs B_CONTIGS
second fasta file contain contigs generated by SimContigs.py
-p PREFIX, --prefix PREFIX
prefix of contig file a and contig file b, divided by comma, like: HA, HB
-o OUTPUT, --output OUTPUT
filename of simulated data
-s BLAST, --blast BLAST
blast file with format 6, must use first file of input as query and second file as database
-c COLLAPSE, --collapse COLLAPSE
persentage of collapse region size, like 5 means 5%, default: 10
- simple_ANGSD.py & simple_ANGSD_without_errorCorrect.py are script for running ANGSD.
simple_ANGSD.py -l <species.list> -anc <outgroup.fasta> -r <region> [-out <out_group_name> -p <bam_path> -ref <ref.fasta>]
simple_ANGSD_without_errorCorrect.py -l <species.list> -r <region> [-out <out_group_name> -p <bam_path>]
Notice:
-p: path of bam files, default is current path
-out: name of outgroup, default is "Outgroup"
- simple_JBrowser.py is a script for generating file for JBrowser
# etc/SimpleJBrowser.conf is a template config file for simple_JBrowser.py
simple_JBrowser.py -f <fasta_file> [--gff <gff_file> --bed <bed_file> --bam <bam_file> --bw <bigwig_file> --conf <config_file>]
- SimSID.py is a script for simulating SNP, Insertions and Deletions.
usage: SimSID.py [-h] [-s SNP] [-i INSERTION] [--insert_length INSERT_LENGTH] [-d DELETION] [--delete_length DELETE_LENGTH] [--random_length] [-v] -r REF -o OUT
options:
-h, --help show this help message and exit
-s SNP, --snp SNP snp ratio of whole genome, percentage, default: 0.01
-i INSERTION, --insertion INSERTION
insertion ratio of whole genome, percentage, default: 0.01
--insert_length INSERT_LENGTH
max length of insertion, default: 10
-d DELETION, --deletion DELETION
delection ratio of whole genome, percentage, default: 0.01
--delete_length DELETE_LENGTH
max length of deletion, default: 10
--random_length use this argument for generate random length of indels
-v, --verbose print detail information
-r REF, --ref REF origin fasta file of genome
-o OUT, --out OUT prefix of simulated data
- split_cmd_with_parts.py is a script for splitting cmd file.
split_cmd_with_parts.py <in_cmd_file> <num_parts> <out_str> <threads>
- split_ctg_with_agp.py is a script for splitting contig fasta file into chromosome groups with agp file.
split_ctg_with_agp.py <in_fa> <in_agp> <out_dir>
- split_fasta_by_chr.py is a script for splitting fasta into several files contain single chromosome.
split_fasta_by_chr.py <in_fasta> <out_dir>
- split_fasta_by_count.py is a script for splitting fasta to several files with file size or sequence counts.
split_fasta_by_count.py <in_fasta> <S/F> <count> <out_dir>
- split_fasta_by_id.py is a script for splitting fasta with id.
split_fasta_by_id.py <in_fasta> <out_dir>
- StatAgp.py & StatAgpDetail.py are scripts for generating statistic with agp file.
StatAgp.py <in_agp>
StatAgpDetail.py <in_agp> <out_csv>
- subVCF.py is a script for extracting vcf file with list file, default missing rate 0.4.
subVCF.py <in_vcf> <in_list> <out_vcf> [<missing_rate>]
- transfer_gff3_with_agp.py is a script for transferring positions with old agp and new agp file.
transfer_gff3_with_agp.py <in_gff3> <in_old_agp> <in_new_agp> <out_gff3>
- eval_synteny.py is a script for evaluating the assembly consistency between query genome and reference genome by mapping cds of reference genome to query genome and reference genome with gmap and extract bed files with jcvi, be sure that the query bed file only contain the chromosomes and/or contigs which you want evalute.
usage: eval_synteny.py [-h] -r REF -q QRY -p PAIR
options:
-h, --help show this help message and exit
-r REF, --ref REF ref.bed
-q QRY, --qry QRY qry.bed
- get_seq_with_bed.py is a script for extracting sequences from fasta file with bed file, the bed file can contain 4 or 5 fields: [seq_id, start_pos, end_pos, out_id] or [seq_id, start_pos, end_pos, direct, out_id]
Usage: python get_seq_with_bed.py <in_fa> <in_bed> <out_fa>
- convert_anchorwave.py is a script for convert anchorwave maf file to a table file, which contains 7 columns: "Ref id, start position, end position, query id, start position, end position, variant type"
usage: convert_anchorwave.py [-h] -i INPUT -o OUTPUT
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input maf file
-o OUTPUT, --output OUTPUT
Out put file
- extract_fasta_with_bed.py is a script for extracting seq with bed file contain 5 colunms: "ID, start, end, direction, id", positions should be 1-based
Usage: python bin/extract_fasta_with_bed.py <in_fa> <in_bed> <out_fa>
Notice: bed should be 5 columns: "ID, start, end, direction, id", positions should be 1-based
- convert_chr_to_ctg_with_agp.py is a script for converting chromosomes to contigs with AGP file.
Usage: python ./bin/convert_chr_to_ctg_with_agp.py <in_fa> <in_agp> <out_fa>
- bam_cov.py is a script for calculating genome coverage ratio from bam file.
usage: bam_cov.py [-h] -b BAM -o OUTPUT [-t THREADS]
options:
-h, --help show this help message and exit
-b BAM, --bam BAM Input bam file, must be indexed
-o OUTPUT, --output OUTPUT
Output statistic
-t THREADS, --threads THREADS
Threads, default=10
- sort_gff3.py is a script for sorting genes with chromosomes and positions, and generating new IDs.
Usage: python ./bin/sort_gff3.py <chr_prefix> <in_gff3> <out_gff3>
Notice: sort and rename id with in_gff by coordinate, the chromosome ID should be like: Chr01 for mono assembly, Chr01A for phased assembly.
Example: python ./bin/sort_gff3.py CB5 in.gff3 out.gff3