[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL #2045

jonpdeaton · 2018-09-07T18:38:13Z

With these changes, I see as much as a 30% speedup for large datasets. Fixes #2023

With these changes, I see as much as a 30% speedup for large datasts.

AmplabJenkins · 2018-09-07T18:43:02Z

Can one of the admins verify this patch?

heuermh · 2018-09-07T19:06:51Z

Jenkins, test this please

heuermh · 2018-09-07T19:07:05Z

Jenkins, add to whitelist

heuermh · 2018-09-07T19:30:01Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/MarkDuplicates.scala

+   * @param alignmentRecords GenomicRDD of alignment records
+   * @return RDD of alignment records with the "duplicateRead" field marked appropriately
+   */
+  def apply(alignmentRecords: AlignmentRecordRDD): RDD[AlignmentRecord] = {


We should refactor the caller here so that we don't force the conversions between RDD and Dataset. In other words, only perform the conversions if alignmentRecords has been realized as RDDBoundAlignmentRecordRDD.

I see, so there should be another apply method which takes Dataset[AlignmentRecord] and this apply method's signature should be changed to take RDDBoundAlignmentRecordRDD?

Or do you mean that this method should just be changed to take Dataset[AlignmentRecord]. In this case would it also make sense to refactor so that the other apply method also just takes Dataset[Fragment] and do a similar refactoring of the caller?

Does this look right? We would want the caller to be

abstract class AlignmentRecordRDD ... { def markDuplicates(): AlignmentRecordRDD = { replaceRdd(MarkDuplicates(this.rdd, this.recordGroups)) } } case class DatasetBoundAlignmentRecordRDD ... { override def markDuplicates(): AlignmentRecordRDD = { replaceDataset(MarkDuplicates(this.dataset, this.recordGroups)) } } abstract class FragmentRDD ... { def markDuplicates(): FragmentRDD = { replaceRdd(MarkDuplicates(this.rdd, this.recordGroups)) } } case class DatasetBoundFragmentRDD ... { override def markDuplicates(): FragmentRDD = { replaceDataset(MarkDuplicates(this.dataset, this.recordGroups)) } }

so then the apply methods might be

import org.bdgenomics.formats.avro.{ AlignmentRecord, Fragment } import org.bdgenomics.adam.sql.{ AlignmentRecord => AlignmentRecordProduct, Fragment => FragmentProduct } object MarkDuplicates { def apply(RDD[AlignmentRecord], RecordGroupDictionary): RDD[AlignmentRecord] = { } def apply(RDD[Fragment], RecordGroupDictionary): RDD[Fragment] = { } def apply(Dataset[AlignmentRecordProduct], RecordGroupDictionary): Dataset[AlignmentRecordProduct] = { } def apply(Dataset[FragmentProduct], RecordGroupDictionary): Dataset[FragmentProduct] = { } }

Alright sounds good. Because of type erasure there will need to be a single apply[T] for RDD and for Dataset for fragments/alignment-records.

Ah, good point. Is RecordGroupDictionary actually required for the fragment cases? If not, that could be the discriminator.

I believe that it is necessary for both so can't be used as a distinguisher.

Would it be better to take RDDBoundAlignmentRecordRDD and DatasetBoundAlignmentRecordRDD so that the conversion from RDD to Dataset that is already implemented in RDDBoundAlignmentRecordRDD is not duplicated within
MarkDuplicates?

heuermh · 2018-09-07T19:31:50Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/MarkDuplicates.scala

+   * Case class which merely extends the Fragment Schema by a single column "duplicateFragment" so that
+   * a DataFrame with fragments having been marked as duplicates can be cast back into a DataSet
+   */
+  private case class FragmentDuplicateSchema(readName: Option[String] = None,


Do you think it would be useful to add duplicateFragment or similarly named field to the Avro schema definition for Fragment, or is this flag only useful in a temporary context?

I'm not sure. I was considering it while developing but then I realized that it really only seems useful in a temporary context since each of the alignment records contained within the fragment also contain a duplicate flag so the schema has redundancies.

coveralls · 2018-09-07T19:34:24Z

Coverage decreased (-0.08%) to 79.054% when pulling 5310fce on jonpdeaton:enhancement/MarkDuplictes-SparkSQL into f1cc2cf on bigdatagenomics:master.

coveralls · 2018-09-07T19:34:24Z

Coverage decreased (-0.1%) to 79.004% when pulling 8e5e245 on jonpdeaton:enhancement/MarkDuplictes-SparkSQL into f1cc2cf on bigdatagenomics:master.

AmplabJenkins · 2018-09-07T19:55:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2804/
Test PASSed.

AmplabJenkins · 2018-09-12T03:27:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2807/
Test PASSed.

… as duplicate

AmplabJenkins · 2018-09-13T22:52:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2808/
Test PASSed.

… for Dataset duplicate marking path

AmplabJenkins · 2018-09-20T00:18:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2815/
Test PASSed.

Because the Spark SQL implementation of duplciate marking was not scaling well to cluster runs, this version convers many of the groupBy followed by Join back to the original dataset operations with Window functions. This hopefully reduces the amount of data that has to be shuffled when running on a cluster and will make the performance benefits scale.

AmplabJenkins · 2018-09-26T21:51:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2820/
Test PASSed.

heuermh · 2020-01-06T20:28:44Z

Closing as WontFix. Performance testing these changes were inconclusive. Feel free to create a new PR after rebasing against git head.

[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL

5310fce

With these changes, I see as much as a 30% speedup for large datasts.

heuermh requested review from fnothaft and heuermh September 7, 2018 19:24

heuermh reviewed Sep 7, 2018

View reviewed changes

Refactored MarkDuplicates interface to avoid forcing RDD conversion

1dd2075

Fixed big arising from some secondary alignments being falsely marked…

2406814

… as duplicate

Added duplicate marking override method for DatasetBoundFragnementRDD…

2eb8ced

… for Dataset duplicate marking path

heuermh closed this Jan 6, 2020

heuermh added this to the 0.31.0 milestone Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL #2045

[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL #2045

jonpdeaton commented Sep 7, 2018

AmplabJenkins commented Sep 7, 2018

heuermh commented Sep 7, 2018

heuermh commented Sep 7, 2018

heuermh Sep 7, 2018

jonpdeaton Sep 10, 2018

jonpdeaton Sep 10, 2018

heuermh Sep 10, 2018

jonpdeaton Sep 10, 2018

heuermh Sep 10, 2018

jonpdeaton Sep 10, 2018 •

edited

Loading

jonpdeaton Sep 11, 2018 •

edited

Loading

heuermh Sep 7, 2018

jonpdeaton Sep 8, 2018

coveralls commented Sep 7, 2018

coveralls commented Sep 7, 2018 •

edited

Loading

AmplabJenkins commented Sep 7, 2018

AmplabJenkins commented Sep 12, 2018

AmplabJenkins commented Sep 13, 2018

AmplabJenkins commented Sep 20, 2018

AmplabJenkins commented Sep 26, 2018

heuermh commented Jan 6, 2020

[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL #2045

[ADAM-2023] Implemented Duplicate Marking algorithm in Spark SQL #2045

Conversation

jonpdeaton commented Sep 7, 2018

AmplabJenkins commented Sep 7, 2018

heuermh commented Sep 7, 2018

heuermh commented Sep 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonpdeaton Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

jonpdeaton Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 7, 2018

coveralls commented Sep 7, 2018 • edited Loading

AmplabJenkins commented Sep 7, 2018

AmplabJenkins commented Sep 12, 2018

AmplabJenkins commented Sep 13, 2018

AmplabJenkins commented Sep 20, 2018

AmplabJenkins commented Sep 26, 2018

heuermh commented Jan 6, 2020

jonpdeaton Sep 10, 2018 •

edited

Loading

jonpdeaton Sep 11, 2018 •

edited

Loading

coveralls commented Sep 7, 2018 •

edited

Loading