Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

filter: Make under-sampling more apparent #1590

Open
victorlin opened this issue Aug 20, 2024 · 0 comments
Open

filter: Make under-sampling more apparent #1590

victorlin opened this issue Aug 20, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented Aug 20, 2024

Context

Undersampling occurs in augur filter when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:

consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. --group-by year --subsample-max-sequences 300 is equivalent to --group-by year --sequences-per-group 150. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.

Some historical context from #1454 (comment):

In the original formulation of only --sequences-per-group the idea was to say specify --sequences-per-group 10 and --group-by country would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of --subsample-max-sequences. I'd think of --subsample-max-sequences as solely specifying --sequences-per-group.

Possible solutions

  1. Add warnings. Example:

    WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.
    
  2. Add an option --output-group-by-sizes to highlight any discrepancies. Example:

    year target size available sequences output size
    2023 150 200 150
    2024 150 100 100

Both solutions have been adopted for --group-by-weights in #1454, but they could be extended to other sampling methods.

@victorlin victorlin added the enhancement New feature or request label Aug 20, 2024
@victorlin victorlin self-assigned this Aug 20, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant