Skip to content

Commit

Permalink
Refactored documentation about top-level preprocessing configuration (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
mhabedank authored Jun 25, 2024
1 parent c423ee4 commit f9bccc7
Showing 1 changed file with 14 additions and 2 deletions.
16 changes: 14 additions & 2 deletions docs/configuration/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,11 +149,11 @@ preprocessing:
undersample_majority: 0.7
```
# Sample Ratio
# Sample Ratio and Size
Sometimes users may want to train on a sample of their input training data (maybe
there's too much, and we only need 20%, or we want to try out ideas on a smaller
subset of our data). In order to achieve this, a user can specify a `sample_ratio`
subset of our data). In order to achieve this, a user can specify a `sample_ratio` or a `sample_size`
to indicate the ratio of the dataset to use for training.

By default, the sample ratio is 1.0, so if not specified, all the data will be
Expand All @@ -165,6 +165,18 @@ preprocessing:
sample_ratio: 0.3
```

Furthermore, if you want to specify the exact number of samples to use for training,
you can use the `sample_size` parameter. For example, if you want to use 1000 samples for training,
you could specify a config like this:

```yaml
preprocessing:
sample_size: 1000
```
!!! warning

`sample_size` can only be used when `sample_ratio` is 1.0, which is the default value.

# Global Max Sequence Length

There are [many factors at play](https://www.youtube.com/watch?v=g68qlo9Izf0&t=2685s)
Expand Down

0 comments on commit f9bccc7

Please # to comment.