Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add a new interval list scatter mode #1786

Merged
merged 8 commits into from
Apr 7, 2022
Merged

Add a new interval list scatter mode #1786

merged 8 commits into from
Apr 7, 2022

Conversation

ldgauthier
Copy link
Contributor

Add a new interval list scatter mode to avoid issue of giant final list in large joint genotyping scatters.
The WARP WGS joint genotyping workflow has a large (~20K) list of intervals of approximately even runtime, which we split by interval count for N < 20K scatters. However, because the INTERVAL_COUNT scatter mode does not allow the number of output lists to exceed the requested scatter count, the final list with the remainder can be very large with a prohibitively long runtime. For example, scattering 5980 ways for 2300 samples yields a final list that contains chromosomes 19, 20, 21, 22, X AND Y. Here we allow more interval lists to be returned so that they are all contain the same number of intervals for shards with reasonable runtimes.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@ldgauthier ldgauthier requested a review from kachulis March 15, 2022 18:39
Copy link
Contributor

@kachulis kachulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @ldgauthier, just some minor requests.


/***
* A scatter by interval **count** which attempts to fill each resulting interval list with approximately equal numbers
* of intervals, disregarding the base count. This approach distributes the remainder lists across the final interval lists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* of intervals, disregarding the base count. This approach distributes the remainder lists across the final interval lists.
* of intervals, disregarding the base count. This approach distributes the remainder intervals across the initial interval lists.

*/
INTERVAL_COUNT_WITH_DISTRIBUTED_REMAINDER(IntervalListScattererByIntervalCountWithDistributedRemainder::new, "Scatter the interval list into similarly sized interval lists " +
"(by interval count, not by base count). " +
"Resulting interval lists will contain similar number of intervals but may exceed the requested number of scatters to account for the remainder.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should no longer exceed the requested number of scatters

@@ -467,7 +467,7 @@ protected int doWork() {
LOG.info(String.format("Wrote %s scatter subdirectories to %s.", scattered.size, OUTPUT));
if (scattered.size != SCATTER_COUNT) {
LOG.warn(String.format(
"Requested scatter width of %s, but only emitted %s. (This may be an expected consequence of running in %s mode.)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still want to remove "only"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, now that the new mode doesn't output extra, that "only" is still appropriate. Good catch.

@@ -393,6 +403,8 @@ private Testcase(final File file, final int scatterWidth, final IntervalListScat
IntervalList.overlaps(LIST_TO_SCATTER_MANY, secondThird))
))))));



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newlines

@@ -0,0 +1,200 @@
chr1:1-195878
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file used of anything?

Copy link
Contributor Author

@ldgauthier ldgauthier Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -0,0 +1,6 @@
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd AS:38 UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitize headers if it's not too difficult

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly the /seq/references path isn't super helpful for most picard contributors, but I like having the info that's there for reproducibility.


@Override
public List<Interval> takeSome(final Interval interval, final long idealSplitWeight, final long currentSize, final double projectSizeOfRemaining) {
final double amount = projectSizeOfRemaining - currentSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're just copying this structure from IntervalListScattererByIntervalCount, but this feels like a very roundabout way to write if (projectSizeOfRemaining > currentSize)

LARGER_INTERVAL_FILE, 60, IntervalListScatterMode.INTERVAL_COUNT_WITH_DISTRIBUTED_REMAINDER,
LARGER_EXPECTED_LISTS
));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test where there is no remainder, and make sure all interval_lists end up the same size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a new test, I split the 200 intervals into 20 lists and verified that they each have 10 intervals.

Copy link
Contributor

@kachulis kachulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ldgauthier! One small comment about a behavior change, but I think it is making the behavior more correct, so I'm ok with it. And I opened #1795 to document the test issues we've been discussing, and hopefully get them fixed someday.

@@ -455,7 +455,7 @@ protected int doWork() {

if (SCATTER_CONTENT != null) {
final long listSize = SUBDIVISION_MODE.make().listWeight(output);
SCATTER_COUNT = (int) listSize / SCATTER_CONTENT;
SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change behavior. I agree that it is more correct this way, but just want to flag it.

@ldgauthier ldgauthier merged commit 9eafe4e into master Apr 7, 2022
@ldgauthier ldgauthier deleted the ldg_newScatterMode branch April 7, 2022 16:52
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants