[#60] Update documentation for the new training step 3

I've documented that this step is available only for Training, not for Household Training at the moment. We can easily add this functionality to Household Training as well if it would be useful.
ipums · Oct 23, 2023 · d793638 · d793638
1 parent 47347b4
commit d793638
Show file tree

Hide file tree

Showing 7 changed files with 35 additions and 10 deletions.
diff --git a/docs/_sources/config.md.txt b/docs/_sources/config.md.txt
@@ -688,7 +688,9 @@ splits = [-1,0,6,11,9999]
   * `use_training_data_features` -- Type: `boolean`. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to `true`, or training features will not be able to be generated, giving null column errors.  For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to `true` or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data.  If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to `false`, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven't changed, you could set it to `true` to save a small amount of processing time.
   * `output_suspicious_TD` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data.  Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.
   * `split_by_id_a` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a "A304BT" has three potential matches in the training data, one each to histid_b "B200", "C201", and "D425", all of those potential matches would either end up in the "train" split or the "test" split when evaluating the model performance.
-  * `feature_importances` -- Type: `boolean`. Optional, and currently not functional.  Whether to record feature importances for the training features when training or evaluating an ML model.
+  * `feature_importances` -- Type: `boolean`. Optional.  Whether to record
+    feature importances or coefficients for the training features when training
+    the ML model. Set this to true to enable training step 3.
 
 
 ```

diff --git a/docs/_sources/link_tasks.md.txt b/docs/_sources/link_tasks.md.txt
@@ -30,15 +30,21 @@ as they are read in.
 Train a machine learning model to use for classification of potential links. This
 requires training data, which is read in in the first step. Comparison features
 are generated for the training data, and then the model is trained on the data
-and saved for use in the Matching task.
+and saved for use in the Matching task. The last step optionally saves some metadata
+like feature importances or coefficients for the model to help with introspection.
 
 ### Task steps
 
-The steps in each of these tasks are the same:
+The first three steps in each of these tasks are the same:
 * Step 0: Ingest the training data from a CSV file.
 * Step 1: Create comparison features.
 * Step 2: Train and save the model.
 
+The last step is available only for Training, not for Household Training.
+* Step 3: Save the coefficients or feature importances of the model for inspection.
+  This step is skipped by default. To enable it, set the `training.feature_importances`
+  config attribute to true in your config file.
+
 ### Related Configuration Sections
 
 * The [`training`](config.html#training-and-models) section is the most important

diff --git a/docs/config.html b/docs/config.html
@@ -760,7 +760,9 @@ <h2>Training and <a class="reference internal" href="models.html"><span class="d
 <li><p><code class="docutils literal notranslate"><span class="pre">use_training_data_features</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to <code class="docutils literal notranslate"><span class="pre">true</span></code>, or training features will not be able to be generated, giving null column errors.  For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to <code class="docutils literal notranslate"><span class="pre">true</span></code> or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data.  If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to <code class="docutils literal notranslate"><span class="pre">false</span></code>, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven’t changed, you could set it to <code class="docutils literal notranslate"><span class="pre">true</span></code> to save a small amount of processing time.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">output_suspicious_TD</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>.  Optional.  Used in the <code class="docutils literal notranslate"><span class="pre">model_exploration</span></code> link task.  Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data.  Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">split_by_id_a</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>.  Optional.  Used in the <code class="docutils literal notranslate"><span class="pre">model_exploration</span></code> link task.  When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a “A304BT” has three potential matches in the training data, one each to histid_b “B200”, “C201”, and “D425”, all of those potential matches would either end up in the “train” split or the “test” split when evaluating the model performance.</p></li>
-<li><p><code class="docutils literal notranslate"><span class="pre">feature_importances</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional, and currently not functional.  Whether to record feature importances for the training features when training or evaluating an ML model.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">feature_importances</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional.  Whether to record
+feature importances or coefficients for the training features when training
+the ML model. Set this to true to enable training step 3.</p></li>
 </ul>
 </li>
 </ul>

diff --git a/docs/link_tasks.html b/docs/link_tasks.html
@@ -71,16 +71,23 @@ <h3>Overview<a class="headerlink" href="#id1" title="Permalink to this heading">
 <p>Train a machine learning model to use for classification of potential links. This
 requires training data, which is read in in the first step. Comparison features
 are generated for the training data, and then the model is trained on the data
-and saved for use in the Matching task.</p>
+and saved for use in the Matching task. The last step optionally saves some metadata
+like feature importances or coefficients for the model to help with introspection.</p>
 </section>
 <section id="id2">
 <h3>Task steps<a class="headerlink" href="#id2" title="Permalink to this heading">¶</a></h3>
-<p>The steps in each of these tasks are the same:</p>
+<p>The first three steps in each of these tasks are the same:</p>
 <ul class="simple">
 <li><p>Step 0: Ingest the training data from a CSV file.</p></li>
 <li><p>Step 1: Create comparison features.</p></li>
 <li><p>Step 2: Train and save the model.</p></li>
 </ul>
+<p>The last step is available only for Training, not for Household Training.</p>
+<ul class="simple">
+<li><p>Step 3: Save the coefficients or feature importances of the model for inspection.
+This step is skipped by default. To enable it, set the <code class="docutils literal notranslate"><span class="pre">training.feature_importances</span></code>
+config attribute to true in your config file.</p></li>
+</ul>
 </section>
 <section id="id3">
 <h3>Related Configuration Sections<a class="headerlink" href="#id3" title="Permalink to this heading">¶</a></h3>

diff --git a/docs/searchindex.js b/docs/searchindex.js
diff --git a/sphinx-docs/config.md b/sphinx-docs/config.md
@@ -688,7 +688,9 @@ splits = [-1,0,6,11,9999]
   * `use_training_data_features` -- Type: `boolean`. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to `true`, or training features will not be able to be generated, giving null column errors.  For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to `true` or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data.  If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to `false`, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven't changed, you could set it to `true` to save a small amount of processing time.
   * `output_suspicious_TD` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data.  Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.
   * `split_by_id_a` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a "A304BT" has three potential matches in the training data, one each to histid_b "B200", "C201", and "D425", all of those potential matches would either end up in the "train" split or the "test" split when evaluating the model performance.
-  * `feature_importances` -- Type: `boolean`. Optional, and currently not functional.  Whether to record feature importances for the training features when training or evaluating an ML model.
+  * `feature_importances` -- Type: `boolean`. Optional.  Whether to record
+    feature importances or coefficients for the training features when training
+    the ML model. Set this to true to enable training step 3.
 
 
 ```

diff --git a/sphinx-docs/link_tasks.md b/sphinx-docs/link_tasks.md
@@ -30,15 +30,21 @@ as they are read in.
 Train a machine learning model to use for classification of potential links. This
 requires training data, which is read in in the first step. Comparison features
 are generated for the training data, and then the model is trained on the data
-and saved for use in the Matching task.
+and saved for use in the Matching task. The last step optionally saves some metadata
+like feature importances or coefficients for the model to help with introspection.
 
 ### Task steps
 
-The steps in each of these tasks are the same:
+The first three steps in each of these tasks are the same:
 * Step 0: Ingest the training data from a CSV file.
 * Step 1: Create comparison features.
 * Step 2: Train and save the model.
 
+The last step is available only for Training, not for Household Training.
+* Step 3: Save the coefficients or feature importances of the model for inspection.
+  This step is skipped by default. To enable it, set the `training.feature_importances`
+  config attribute to true in your config file.
+
 ### Related Configuration Sections
 
 * The [`training`](config.html#training-and-models) section is the most important