-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add TrainGCNV input specifying subset list of samples for training #294
Conversation
wdl/TrainGCNV.wdl
Outdated
@@ -111,7 +125,8 @@ workflow TrainGCNV { | |||
} | |||
} | |||
|
|||
Array[Int] sample_indices = select_first([RandomSubsampleStringArray.subsample_indices_array, range(length(samples))]) | |||
Array[Int] sample_indices = select_first([GetSubsampledIndices.subsample_indices_array, RandomSubsampleStringArray.subsample_indices_array, range(length(samples))]) | |||
Array[String] sample_ids = select_first([GetSubsampledIndices.subsampled_strings_array, RandomSubsampleStringArray.subsampled_strings_array, samples]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler/clearer to move this into the scatter
below:
scatter (i in sample_indices) {
String sample_ids = samples[i]
call cov.CondenseReadCounts as CondenseReadCounts {
input:
counts = count_files[i],
sample = samples[i],
num_bins = condense_num_bins,
expected_bin_size = condense_bin_size,
condense_counts_docker = condense_counts_docker,
runtime_attr_override=condense_counts_runtime_attr
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minor style comment, I might rename this to something making it clear it isn't an input. I usually add an extra underscore to the end, or you could call it maybe_subsetted_sample_ids
for example.
wdl/Utils.wdl
Outdated
|
||
RuntimeAttr default_attr = object { | ||
cpu_cores: 1, | ||
mem_gb: 3.75, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mem_gb: 3.75, | |
mem_gb: 1, |
wdl/Utils.wdl
Outdated
all_strings = ['~{sep="','" all_strings}'] | ||
subset_strings = {'~{sep="','" subset_strings}'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more robust/scalable to use write_lines()
to write the strings to a file and read them in, rather than defining them inline (who knows what would happen if we fed this 10k samples for example).
Updates since review:
Testing since review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
Updates
Add
sample_ids_training_subset
input toTrainGCNV
to allow Terra users to specify a subsetted list of sample IDs on which to train the gCNV model, while keeping the same batching they will use for GatherBatchEvidence. Intended as an alternative to the random downsampling option if the user wants a greater level of control over the samples chosen.Testing
TrainGCNV
ontest_small
test set and specified four sample IDs for training, out of order. Verified that the four were sorted according to the order of thesamples
input (and other sample-level inputs) and the rest of the tasks were performed on the subset of samples.