From f9bccc7ac09be5c8728ad744f401d3437e5bd179 Mon Sep 17 00:00:00 2001
From: mhabedank <43589214+mhabedank@users.noreply.github.com>
Date: Tue, 25 Jun 2024 14:38:04 +0200
Subject: [PATCH] Refactored documentation about top-level preprocessing
 configuration (#365)

---
 docs/configuration/preprocessing.md | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/docs/configuration/preprocessing.md b/docs/configuration/preprocessing.md
index 795ee6a9..f868a362 100644
--- a/docs/configuration/preprocessing.md
+++ b/docs/configuration/preprocessing.md
@@ -149,11 +149,11 @@ preprocessing:
   undersample_majority: 0.7
 ```
 
-# Sample Ratio
+# Sample Ratio and Size
 
 Sometimes users may want to train on a sample of their input training data (maybe
 there's too much, and we only need 20%, or we want to try out ideas on a smaller
-subset of our data). In order to achieve this, a user can specify a `sample_ratio`
+subset of our data). In order to achieve this, a user can specify a `sample_ratio` or a `sample_size`
 to indicate the ratio of the dataset to use for training.
 
 By default, the sample ratio is 1.0, so if not specified, all the data will be
@@ -165,6 +165,18 @@ preprocessing:
   sample_ratio: 0.3
 ```
 
+Furthermore, if you want to specify the exact number of samples to use for training, 
+you can use the `sample_size` parameter. For example, if you want to use 1000 samples for training, 
+you could specify a config like this:
+
+```yaml
+preprocessing:
+  sample_size: 1000
+```
+!!! warning
+
+    `sample_size` can only be used when `sample_ratio` is 1.0, which is the default value.
+
 # Global Max Sequence Length
 
 There are [many factors at play](https://www.youtube.com/watch?v=g68qlo9Izf0&t=2685s)