Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix: --disable_group_texts 1 keep short samples #649

Merged
merged 1 commit into from
Sep 22, 2023

Conversation

research4pan
Copy link
Contributor

@research4pan research4pan commented Sep 22, 2023

--disable_group_texts 1 would discard short samples with length smaller than model_max_length or --block_size, and long samples will be cut into blocks, with last block being discarded. Now the behavior is changed to what is expected:

  • The short samples will be right padded to model_max_length
  • The long samples will be truncated to model_max_length

Along with the corresponding cache mechanism (new cache will be generated once --disable_group_texts argument is different from previous caches)

`--disable_group_texts 1` would discard short samples with length smaller than
`model_max_length` or `--block_size`. Now the behavior is changed to what is
expected:
    - The short samples will be right padded to `model_max_length`
    - The long samples will be truncated to `model_max_length`

Along with the corresponding cache mechanism (new cache will be generated once
`--disable_group_texts` argument is different from previous caches)
Copy link
Member

@SHUMKASHUN SHUMKASHUN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SHUMKASHUN SHUMKASHUN merged commit 6fae940 into main Sep 22, 2023
@research4pan research4pan deleted the rpan-disable-group-text branch March 31, 2024 09:41
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants