maize-genetics · lynnjo · Dec 15, 2023 · Dec 6, 2023 · Dec 7, 2023 · Dec 7, 2023
diff --git a/docs/build_and_load.md b/docs/build_and_load.md
@@ -31,10 +31,15 @@ In this document, we will discuss the steps needed to:
         --gff anchors.gff \
         --reference-file /my/ref.fasta \
         --assemblies assemblies_list.txt \
-        --total-threads 20 \
-        --in-parallel 4 \
         -o /path/for/generated_files
     ```
+* Update FASTA headers with sample information:
+    ```shell
+    phg annotate-fastas \
+        --keyfile /path/to/keyfile \
+        --output-dir /path/to/annotated/fastas \
+        --threads 10
+    ```
 * Compress FASTA files
     ```shell
     phg agc-compress \
@@ -325,6 +330,12 @@ This command uses several parameters:
   `LineB`. Since these are located in a subdirectory called `data`
   relative to my working directory, I will also add that to the path.
 
+* `-o` - The name of the directory for the alignment outputs.
+
+
+In addition to the above parameters, there are two optional parameters. When values are not specified for these parameters,
+default values are calculated by the software based on the system processor and memory configuration.
+
 * `--total-threads` - How many threads would you like to allocate for
   the alignment step? More information about this step and the 
   `--in-parallel` step can be found in the following **Details - 
@@ -333,7 +344,7 @@ This command uses several parameters:
   More information about this step and the `--total-threads` step can
   be found in the following **Details - threads and parallelization** 
   section.
-* `-o` - The name of the directory for the alignment outputs. 
+
 
 > [!WARNING]
 > The directory that you specify in the output (`-o`) section must
@@ -407,7 +418,52 @@ or review the following code blocks:
 > prior `-R` and `-Q` parameters.
 
 
-[//]: # (#### Details - threads and parallelization)
+#### Details - threads and parallelization
+Aligning with anchorwave is memory intensive and can be slow.  Processing speed may be increased by
+using multiple threads for each alignment, and/or by running multiple alignments in parallel.  The amount of memory
+each thread takes is dependent on the processor type.  The table below shows the memory usage for a single
+alignment based on processor type:
+
+| Processor | peak memory (GB) | wall time |
+|-----------|------------------|-----------|
+| SSE2      | 20.1             | 26:47:17  |
+| SSE4.1    | 20.6             | 24:05:07  |
+| AVX2      | 20.1             | 21:40:00  |
+| AVX512    | 20.1             | 18:31:39  |
+| ARM       | 34.2             | 18:08:57  |
+
+The `--total-threads` parameter indicates the total number of threads available for system use.  The `--in-parallel`
+parameter controls the number of alignments run in parallel.  When these values are not specified, the software will
+compute the optimal values based on the system processor and memory configuration. 
+
+The number of threads that may be run in parallel is limited by the amount of memory available.  The
+system is queried for memory and processor information.  The number of threads that may be run in parallel
+is determined by "system memory" / "peak memory" from the table above.  To generalize the calculation, we divide
+memory available (GiB) by 21 and round down to the nearest integer.
+
+For example, if the system has 512 GB of memory, 80 processors and 5 assemblies that need aligning,
+the maximum number of threads that could be run in parallel is 24 (512/21).  The number of potential parallel
+alignments with the threads allocated for each is shown in the table below
+
+
+| Alignments in parallel | Threads per alignment |
+|------------------------|-----------------------|
+| 5                      | 4                     |
+| 4                      | 6                     |
+| 3                      | 8                     |
+| 2                      | 12                    |
+| 1                      | 24                    |
+
+The software will select a pairing that maximizes the number of alignments run in parallel while utilizing multiple threads, 
+opting for a value in the middle.  In the case above with 5 assemblies and a possibility of 24 concurrent threads,
+the system will choose to run 3 alignments in parallel with 8 threads each.  The total number of threads used
+will be 24 (3 * 8).
+
+User defined values for in-parallel and total-threads are considered along with the number of assemblies to
+align and system capacities, when determining the anchorwave setup.
+
+
+
 
 [//]: # ()
 [//]: # (***WIP*** - revisit once we can agree on naming and parameter conventions)
@@ -442,64 +498,14 @@ or review the following code blocks:
 
 [//]: # (| ARM       | 34.2             | 18:08:57  |)
 
-
-### Compress FASTA files
-For optimal storage of sequence information, we can convert our 
-"plain-text" FASTA files into a more compact representation using 
-compression. For PHGv2, we can use a command called `agc-compress`, 
-which is a wrapper for the
-[Assembled Genomes Compressor](https://github.com/refresh-bio/agc) 
-(AGC). AGC provides performant and efficient compression ratios for 
-our assembly genomes. Like AnchorWave, AGC is also installed during
-the Conda environment setup phase, so there is no need to install 
-this manually.
-
-To run the compression step, we can call the `align-assemblies` 
-command:
-
-```shell
-./phg agc-compress \
-    --db-path vcf_dbs \
-    --fasta-list data/assemblies_list.txt \
-    --reference-file data/Ref.fa
-```
-
-This command takes in 3 parameters:
-* `--db-path` - path to directory storing the TileDB instances. The
-  AGC compressed genomes will be placed here on completion.
-
-> [!NOTE]
-> The directory specified here should be the same directory used to 
-> initialize the TileDB instances in the database initialization 
-> (`initdb`) step.
-
-* `--fasta-list` - List of assembly FASTA genomes to compress.
-
-> [!NOTE]
-> The list specified in `--fasta-list` _can_ be the same list used
-> in the alignment (`align-assemblies`) step, but make sure header
-> information in each FASTA contains sample IDs. See the 
-> [**"Important information regarding `agc` compression"**](#warning-important-information-regarding-agc-compression-warning)
-> section for further information and how to remedy this.
-
-* `--reference-file` - Reference FASTA genome.
-
-After compression is finished, we can navigate to the directory
-containing the TileDB instances. In my case, this would be the
-subdirectory, `vcf_dbs`. Here, you will see a new file created:
-`assemblies.agc`. This is the compressed AGC file containing our 
-assemblies. This file will be used later to query for haplotype
-sequence regions and composite genome creation.
-
-#### :warning: Important information regarding `agc` compression :warning:
-
+### Annotate Fastas
 As of the current date of this document, the return methods of
-AGC will not keep track of sample IDs when returning 
-sequence information from the compressed file **unless you explicitly 
-state the sample information in the header lines of the FASTA files 
-you wish to compress for the PHGv2 databases**. To explain this 
-further, let's imagine that we have two FASTA files: `LineA.fa` and 
-`LineB.fa`. These files contain information for only chromosome 1 and 
+AGC will not keep track of sample IDs when returning
+sequence information from the compressed file **unless you explicitly
+state the sample information in the header lines of the FASTA files
+you wish to compress for the PHGv2 databases**. To explain this
+further, let's imagine that we have two FASTA files: `LineA.fa` and
+`LineB.fa`. These files contain information for only chromosome 1 and
 have a simple header that denotes this sequence name (e.g. `chr1`):
 
 ```
@@ -512,12 +518,12 @@ $ head LineB.fa
 ATGCGTTCGCCTTCCG
 ```
 
-After we compress this information using the `agc-compress` command, 
+After we compress this information using the `agc-compress` command,
 we will have a file called `assemblies.agc`. PHGv2 leverages this
 compressed sequence file when creating VCF data (_see
-the [**"Create VCF files"**](#create-vcf-files) section for further 
+the [**"Create VCF files"**](#create-vcf-files) section for further
 details_). The issue arrives when we query the compressed data using
-[AGC's `getctg`](https://github.com/refresh-bio/agc#extract-contigs-from-the-archive) 
+[AGC's `getctg`](https://github.com/refresh-bio/agc#extract-contigs-from-the-archive)
 command. When we query this hypothetical archive, we will get the
 following information:
 
@@ -531,7 +537,7 @@ ATGCGTACGCGCACCG
 ATGCGTTCGCCTTCCG
 ```
 
-As you can see from the above hypothetical output, we now have no 
+As you can see from the above hypothetical output, we now have no
 means to efficiently track where each sequence is coming from due
 to the lack of header information from our prior FASTA files. We
 can remedy this by adding sample information to the headers:
@@ -559,7 +565,7 @@ ATGCGTACGCGCACCG
 ATGCGTTCGCCTTCCG
 ```
 
-While we can either manually modify the header lines of our FASTA 
+While we can either manually modify the header lines of our FASTA
 file, this can become tedious and prone to a new set of downstream
 errors. To automate this, PHGv2 provides an optional command to
 append sample information to the headers of each FASTA file called
@@ -574,35 +580,37 @@ phg annotate-fastas \
 
 This command takes 3 parameters:
 
-* `--keyfile` - A [tab-delimited](https://en.wikipedia.org/wiki/Tab-separated_values) 
+* `--keyfile` - A [tab-delimited](https://en.wikipedia.org/wiki/Tab-separated_values)
   keyfile containing two columns:
 
-  | Column | Value                                                                                                                                                                                                                      |
-  |--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | 1      | Path to FASTA file you would like annotated (this is similar to the text files used to point to the FASTA file paths in the [`agc-compress`](#compress-fasta-files) and [`align-assemblies`](#align-assemblies) commands). |
-* | 2      | Name of the sample that will be appended to each header line                                                                                                                                                               |
+| Column | Value                                                                                                                                                                                                                      |
+|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 1      | Path to FASTA file you would like annotated (this is similar to the text files used to point to the FASTA file paths in the [`agc-compress`](#compress-fasta-files) and [`align-assemblies`](#align-assemblies) commands). |
+| 2      | Name of the sample that will be appended to each header line                                                                                                                                                               |
+
   + My example `annotation_keyfile.txt` file would look like this:
+
   ```
   data/LineA.fa   LineA
   data/LineB.fa   LineB
   ```
 * `--threads` - Optional number of threads to annotate multiple
   FASTA files in parallel. _Defaults to `1`_.
-* `-o` - Output directory for the newly annotated FASTA files 
+* `-o` - Output directory for the newly annotated FASTA files
 
 > [!WARNING]
 > This step must be performed before the `agc-compress` step.
 
 > [!WARNING]
-> Sample IDs in the keyfile must match with what is found in the 
+> Sample IDs in the keyfile must match with what is found in the
 > TileDB instances.
 
 > [!NOTE]
 > FASTA files can be either uncompressed or compressed. If
 > compressed, the extension should be `*.gz`.
 
 Once finished, this command will produce FASTA files with the name
-of the sample from the keyfile appended to each header line. For 
+of the sample from the keyfile appended to each header line. For
 example, in our hypothetical FASTA files, our headers go from this:
 
 ```
@@ -635,6 +643,60 @@ ATGCGTACGCGCACCG
 >ATGCGTACGCGCACCG
 >```
 
+
+
+
+
+### Compress FASTA files
+For optimal storage of sequence information, we can convert our 
+"plain-text" FASTA files into a more compact representation using 
+compression. For PHGv2, we can use a command called `agc-compress`, 
+which is a wrapper for the
+[Assembled Genomes Compressor](https://github.com/refresh-bio/agc) 
+(AGC). AGC provides performant and efficient compression ratios for 
+our assembly genomes. Like AnchorWave, AGC is also installed during
+the Conda environment setup phase, so there is no need to install 
+this manually.
+
+To run the compression step, we can call the `align-assemblies` 
+command:
+
+```shell
+./phg agc-compress \
+    --db-path vcf_dbs \
+    --fasta-list data/assemblies_list.txt \
+    --reference-file data/Ref.fa
+```
+
+This command takes in 3 parameters:
+* `--db-path` - path to directory storing the TileDB instances. The
+  AGC compressed genomes will be placed here on completion.
+
+> [!NOTE]
+> The directory specified here should be the same directory used to 
+> initialize the TileDB instances in the database initialization 
+> (`initdb`) step.
+
+* `--fasta-list` - List of assembly FASTA genomes to compress.
+
+> [!NOTE]
+> The list specified in `--fasta-list` _can_ be the same list used
+> in the alignment (`align-assemblies`) step, but make sure header
+> information in each FASTA contains sample IDs. See the 
+> [**"Important information regarding `agc` compression"**](#warning-important-information-regarding-agc-compression-warning)
+> section for further information and how to remedy this.
+
+* `--reference-file` - Reference FASTA genome.
+
+After compression is finished, we can navigate to the directory
+containing the TileDB instances. In my case, this would be the
+subdirectory, `vcf_dbs`. Here, you will see a new file created:
+`assemblies.agc`. This is the compressed AGC file containing our 
+assemblies. This file will be used later to query for haplotype
+sequence regions and composite genome creation.
+
+
+
 ### Create VCF files
 Now that we have (1) created alignments of our assemblies against a
 single reference genome and (2) created compressed representations