Refactored documentation about top-level preprocessing configuration (#…

…365)
ludwig-ai · Jun 25, 2024 · f9bccc7 · f9bccc7
1 parent c423ee4
commit f9bccc7
Showing 1 changed file with 14 additions and 2 deletions.
diff --git a/docs/configuration/preprocessing.md b/docs/configuration/preprocessing.md
@@ -149,11 +149,11 @@ preprocessing:
   undersample_majority: 0.7
 ```
 
-# Sample Ratio
+# Sample Ratio and Size
 
 Sometimes users may want to train on a sample of their input training data (maybe
 there's too much, and we only need 20%, or we want to try out ideas on a smaller
-subset of our data). In order to achieve this, a user can specify a `sample_ratio`
+subset of our data). In order to achieve this, a user can specify a `sample_ratio` or a `sample_size`
 to indicate the ratio of the dataset to use for training.
 
 By default, the sample ratio is 1.0, so if not specified, all the data will be
@@ -165,6 +165,18 @@ preprocessing:
   sample_ratio: 0.3
 ```
 
+Furthermore, if you want to specify the exact number of samples to use for training, 
+you can use the `sample_size` parameter. For example, if you want to use 1000 samples for training, 
+you could specify a config like this:
+
+```yaml
+preprocessing:
+  sample_size: 1000
+```
+!!! warning
+
+    `sample_size` can only be used when `sample_ratio` is 1.0, which is the default value.
+
 # Global Max Sequence Length
 
 There are [many factors at play](https://www.youtube.com/watch?v=g68qlo9Izf0&t=2685s)