Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

Merged
merged 4 commits into from
Sep 7, 2016

Conversation

sumwale
Copy link

@sumwale sumwale commented Sep 7, 2016

What changes were proposed in this pull request?

  • added a aggBufferWithKeyAttributes to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map)
  • use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse
  • temp change to use a local cache for ClosureCleaner to avoid cleaning closures that can be serialized as is
  • minor correction in the string in HiveUtils

Note that the closure cleaner change is just a temporary hack for testing. It will be turned into a proper shape by caching the steps for cleaning of each class, if any, and then applying those steps in order (this strategy may not work very well for polymorphic types but user can take care explicitly for such special cases). This is being tracked in a separate JIRA.

How was this patch tested?

Applied and tested with upstream spark branch-2.0

Sumedh Wale added 3 commits September 7, 2016 12:14
 - for all cases of implicit casts, convert to date or timestamp values
   instead of string when one side is a string
 - likewise when one side is a timestamp and other date then both are being
   converted to string; now convert date to timestamp
 - added a aggBufferWithKeyAttributes to aggregates to be used to avoid nullable
   checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate
       is on zero rows, then there will be no row in the map)
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - temp change to use a local cache for ClosureCleaner to avoid cleaning closures
   that can be serialized as is
 - minor correction in the string in HiveUtils
Reverting non-null aggregate attributes for Min, Max, First, Last since these
depend on aggregate value to be null initially to set the value during iteration.

Added a "initialValuesForGroup" like for aggregate attributes that will setup
the value with zero for the data type for Sum, Average when creating initial
aggregation buffer (instead of null)

Renamed "aggregateBufferWithKeyAttribute" to "aggregateBufferAttributeForGroup"

Reverted the optimization hack in ClosureCleaner for now.
@sumwale sumwale merged commit 0313b61 into snappy/branch-2.0 Sep 7, 2016
@sumwale sumwale deleted the SNAP-1034 branch September 7, 2016 15:42
ymahajan pushed a commit that referenced this pull request Jan 13, 2017
 - added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable
   checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate
       is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup"
   added for initial aggregation buffer values
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - minor correction in the string in HiveUtils
sumwale pushed a commit that referenced this pull request Jul 8, 2017
 - added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable
   checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate
       is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup"
   added for initial aggregation buffer values
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - minor correction in the string in HiveUtils
ymahajan pushed a commit that referenced this pull request Feb 22, 2018
 - added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable
   checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate
       is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup"
   added for initial aggregation buffer values
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - minor correction in the string in HiveUtils

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
ashetkar pushed a commit that referenced this pull request Apr 5, 2018
* Use alpine and java 8 for docker images.

* Remove installation of vim and redundant comment
sumwale pushed a commit to sumwale/spark that referenced this pull request Nov 5, 2020
…oftware#10)

 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - add addLong/longValue methods to SQLMetric that use primitive longs instead of Long objects
sumwale pushed a commit that referenced this pull request Jul 11, 2021
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - add addLong/longValue methods to SQLMetric that use primitive longs instead of Long objects
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants