filter: Make under-sampling more apparent #1590

victorlin · 2024-08-20T23:56:16Z

Context

Undersampling occurs in augur filter when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:

consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. --group-by year --subsample-max-sequences 300 is equivalent to --group-by year --sequences-per-group 150. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.

Some historical context from #1454 (comment):

In the original formulation of only --sequences-per-group the idea was to say specify --sequences-per-group 10 and --group-by country would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of --subsample-max-sequences. I'd think of --subsample-max-sequences as solely specifying --sequences-per-group.

Possible solutions

Add warnings. Example:

WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.

Add an option --output-group-by-sizes to highlight any discrepancies. Example:

year target size available sequences output size

2023 150 200 150

2024 150 100 100

Both solutions have been adopted for --group-by-weights in #1454, but they could be extended to other sampling methods.

The text was updated successfully, but these errors were encountered:

victorlin added the enhancement New feature or request label Aug 20, 2024

victorlin self-assigned this Aug 20, 2024

victorlin mentioned this issue Aug 21, 2024

Implement weighted sampling #1454

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Make under-sampling more apparent #1590

filter: Make under-sampling more apparent #1590

victorlin commented Aug 20, 2024 •

edited

Loading

filter: Make under-sampling more apparent #1590

filter: Make under-sampling more apparent #1590

Comments

victorlin commented Aug 20, 2024 • edited Loading

Context

Possible solutions

victorlin commented Aug 20, 2024 •

edited

Loading