From f9bccc7ac09be5c8728ad744f401d3437e5bd179 Mon Sep 17 00:00:00 2001 From: mhabedank <43589214+mhabedank@users.noreply.github.com> Date: Tue, 25 Jun 2024 14:38:04 +0200 Subject: [PATCH] Refactored documentation about top-level preprocessing configuration (#365) --- docs/configuration/preprocessing.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/docs/configuration/preprocessing.md b/docs/configuration/preprocessing.md index 795ee6a9..f868a362 100644 --- a/docs/configuration/preprocessing.md +++ b/docs/configuration/preprocessing.md @@ -149,11 +149,11 @@ preprocessing: undersample_majority: 0.7 ``` -# Sample Ratio +# Sample Ratio and Size Sometimes users may want to train on a sample of their input training data (maybe there's too much, and we only need 20%, or we want to try out ideas on a smaller -subset of our data). In order to achieve this, a user can specify a `sample_ratio` +subset of our data). In order to achieve this, a user can specify a `sample_ratio` or a `sample_size` to indicate the ratio of the dataset to use for training. By default, the sample ratio is 1.0, so if not specified, all the data will be @@ -165,6 +165,18 @@ preprocessing: sample_ratio: 0.3 ``` +Furthermore, if you want to specify the exact number of samples to use for training, +you can use the `sample_size` parameter. For example, if you want to use 1000 samples for training, +you could specify a config like this: + +```yaml +preprocessing: + sample_size: 1000 +``` +!!! warning + + `sample_size` can only be used when `sample_ratio` is 1.0, which is the default value. + # Global Max Sequence Length There are [many factors at play](https://www.youtube.com/watch?v=g68qlo9Izf0&t=2685s)