-
Notifications
You must be signed in to change notification settings - Fork 0
02 Get started
To get started you need a python analysis file where you define your input, example scripts can be found in the main folder. Further settings for the alignment building process and filtering options are being done in the configuration file - see 03 Settings options. For more examples see 04 advanced analyses.
To set up an analysis that will add sequences that are similar and long enough to the input as long as they are no subsequences of an already existing sequence until the taxon reached the threshold. The sequences retrieved from the BLAST search/during the updating are being filtered before a new blast round begins.
There is an example file in example_analysis.py
, it comes with a tiny sample dataset in tests/data/tiny_example
.
- seqaln: give the path to your single sequence or alignment file, must be a single gene alignment
- mattype: file format of your alignment - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
- trfn: optional: give the path to the file containing the corresponding phylogeny, all tips must be represented in the alignment file as well.
- schema_trf: file format of your phylogeny file - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
-
id_to_spn: path to a comma-delimited file where tip labels correspond to species names: example file can be found in
tests/data/tiny_test_example/test_nicespl.csv
. Species names must be provided with '_' rather than a 'whitespace', also for species epithets with a '-' in its name. - workdir: path to your working directory, the folder where intermediate and result files shall be stored.
- configfi: path to your configuration file, configuration options are explained in "03 Settings options".
Note: Specified paths have to start either from your root directory (e.g. /home/USER/PhylUp/path/to/file
) or can be relative from within the PhylUp main folder (./path/to/file
).
Besides the standard definition, there are more input options, see below. Currently supported are:
-
ignore_acc_list: A list of sequences that shall not be added, or if it is part of the input it will be removed. This needs to be formatted as a python list containing the GenBank identifiers (e.g.
[accession number, accession number]
). - status_end: This is an option to tell the program how often to run the blast searches. If it is set to 1, it will only blast the input sequences once and adds those found to the alignment. If it is set to 2, it will blast the newly retrieved sequences from the initial blast search and adds those results to the alignment as well. As such the input plus the first round of newly found sequences is blasted. If it is set to 3, it will add one more 'round', etc...
Start the analyses from your PhylUp main folder:
python3 ./path/to/file/analysis-file.py
The blasting takes unfortuntaly quite some time, especially for larger alignments (options for speed up - see configuration option). an option is to run the analyses on a cluster.
After the single-gene PhylUp runs were updated, the data can be combined using phylogenetic concatenation, see for example example_concat.py
.
During a PhylUp run, several files are being written out: Here is a short introduction to what they are:
- all_new_seqs.updated: csv file with all information available about the newly added sequences
- fulltree.raxml.: all the files generated by RAxML-NG if the tree updating was enabled
- logfile: short summary of how many sequences where added/filtered during a PhylUp run
- table.updated: all sequences that were considered to be added, but of which some have been removed because of the filtering steps - does not include sequences that were not added because they are not part of the mrca or because they were too short - for those see wrong_mrca.csv and wrong_seq_length.csv
- updt_aln.fasta: updated alignment; there is also a relabeled version with tipnames as a combination of taxon name and accession number
- updt_tre.tre: updated tree or unresolved tree if no phylogeny was provided as input; there is also a relabeled version with tipnames as a combination of taxon name and accession number
- orig_inputaln.fasta: data as supplied in the input
- orig_tre.tre: data as supplied in the input
- blast: folder that holds all blast results
- tmp: folder that holds taxonomic information, complete sequences and internally used files.