-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add a new interval list scatter mode #1786
Conversation
in large joint genotyping scatters
requested number of intervals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @ldgauthier, just some minor requests.
|
||
/*** | ||
* A scatter by interval **count** which attempts to fill each resulting interval list with approximately equal numbers | ||
* of intervals, disregarding the base count. This approach distributes the remainder lists across the final interval lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* of intervals, disregarding the base count. This approach distributes the remainder lists across the final interval lists. | |
* of intervals, disregarding the base count. This approach distributes the remainder intervals across the initial interval lists. |
*/ | ||
INTERVAL_COUNT_WITH_DISTRIBUTED_REMAINDER(IntervalListScattererByIntervalCountWithDistributedRemainder::new, "Scatter the interval list into similarly sized interval lists " + | ||
"(by interval count, not by base count). " + | ||
"Resulting interval lists will contain similar number of intervals but may exceed the requested number of scatters to account for the remainder."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should no longer exceed the requested number of scatters
@@ -467,7 +467,7 @@ protected int doWork() { | |||
LOG.info(String.format("Wrote %s scatter subdirectories to %s.", scattered.size, OUTPUT)); | |||
if (scattered.size != SCATTER_COUNT) { | |||
LOG.warn(String.format( | |||
"Requested scatter width of %s, but only emitted %s. (This may be an expected consequence of running in %s mode.)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still want to remove "only"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, now that the new mode doesn't output extra, that "only" is still appropriate. Good catch.
@@ -393,6 +403,8 @@ private Testcase(final File file, final int scatterWidth, final IntervalListScat | |||
IntervalList.overlaps(LIST_TO_SCATTER_MANY, secondThird)) | |||
)))))); | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newlines
@@ -0,0 +1,200 @@ | |||
chr1:1-195878 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this file used of anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -0,0 +1,6 @@ | |||
@HD VN:1.6 SO:coordinate | |||
@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd AS:38 UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanitize headers if it's not too difficult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly the /seq/references path isn't super helpful for most picard contributors, but I like having the info that's there for reproducibility.
|
||
@Override | ||
public List<Interval> takeSome(final Interval interval, final long idealSplitWeight, final long currentSize, final double projectSizeOfRemaining) { | ||
final double amount = projectSizeOfRemaining - currentSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're just copying this structure from IntervalListScattererByIntervalCount
, but this feels like a very roundabout way to write if (projectSizeOfRemaining > currentSize)
LARGER_INTERVAL_FILE, 60, IntervalListScatterMode.INTERVAL_COUNT_WITH_DISTRIBUTED_REMAINDER, | ||
LARGER_EXPECTED_LISTS | ||
)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test where there is no remainder, and make sure all interval_lists end up the same size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a new test, I split the 200 intervals into 20 lists and verified that they each have 10 intervals.
w.r.t. scatter count versus scatter content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ldgauthier! One small comment about a behavior change, but I think it is making the behavior more correct, so I'm ok with it. And I opened #1795 to document the test issues we've been discussing, and hopefully get them fixed someday.
@@ -455,7 +455,7 @@ protected int doWork() { | |||
|
|||
if (SCATTER_CONTENT != null) { | |||
final long listSize = SUBDIVISION_MODE.make().listWeight(output); | |||
SCATTER_COUNT = (int) listSize / SCATTER_CONTENT; | |||
SCATTER_COUNT = (int)Math.round((double) listSize / SCATTER_CONTENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change behavior. I agree that it is more correct this way, but just want to flag it.
Add a new interval list scatter mode to avoid issue of giant final list in large joint genotyping scatters.
The WARP WGS joint genotyping workflow has a large (~20K) list of intervals of approximately even runtime, which we split by interval count for N < 20K scatters. However, because the INTERVAL_COUNT scatter mode does not allow the number of output lists to exceed the requested scatter count, the final list with the remainder can be very large with a prohibitively long runtime. For example, scattering 5980 ways for 2300 samples yields a final list that contains chromosomes 19, 20, 21, 22, X AND Y. Here we allow more interval lists to be returned so that they are all contain the same number of intervals for shards with reasonable runtimes.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests