[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

sumwale · 2016-09-07T13:24:17Z

What changes were proposed in this pull request?

added a aggBufferWithKeyAttributes to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map)
use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse
temp change to use a local cache for ClosureCleaner to avoid cleaning closures that can be serialized as is
minor correction in the string in HiveUtils

Note that the closure cleaner change is just a temporary hack for testing. It will be turned into a proper shape by caching the steps for cleaning of each class, if any, and then applying those steps in order (this strategy may not work very well for polymorphic types but user can take care explicitly for such special cases). This is being tracked in a separate JIRA.

How was this patch tested?

Applied and tested with upstream spark branch-2.0

- for all cases of implicit casts, convert to date or timestamp values instead of string when one side is a string - likewise when one side is a timestamp and other date then both are being converted to string; now convert date to timestamp

- added a aggBufferWithKeyAttributes to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map) - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - temp change to use a local cache for ClosureCleaner to avoid cleaning closures that can be serialized as is - minor correction in the string in HiveUtils

Reverting non-null aggregate attributes for Min, Max, First, Last since these depend on aggregate value to be null initially to set the value during iteration. Added a "initialValuesForGroup" like for aggregate attributes that will setup the value with zero for the data type for Sum, Average when creating initial aggregation buffer (instead of null) Renamed "aggregateBufferWithKeyAttribute" to "aggregateBufferAttributeForGroup" Reverted the optimization hack in ClosureCleaner for now.

- added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup" added for initial aggregation buffer values - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - minor correction in the string in HiveUtils

- added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup" added for initial aggregation buffer values - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - minor correction in the string in HiveUtils Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

* Use alpine and java 8 for docker images. * Remove installation of vim and redundant comment

…oftware#10) - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - add addLong/longValue methods to SQLMetric that use primitive longs instead of Long objects

- use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - add addLong/longValue methods to SQLMetric that use primitive longs instead of Long objects

Sumedh Wale added 3 commits September 7, 2016 12:14

updating comment about string conversions

b13aee7

sumwale assigned rishitesh and hbhanawat Sep 7, 2016

sumwale merged commit 0313b61 into snappy/branch-2.0 Sep 7, 2016

sumwale deleted the SNAP-1034 branch September 7, 2016 15:42

ashetkar pushed a commit that referenced this pull request Apr 5, 2018

Use alpine and java 8 for docker images. (#10)

793143d

* Use alpine and java 8 for docker images. * Remove installation of vim and redundant comment

ericjohnson-tibco unassigned rishitesh and hbhanawat Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

sumwale commented Sep 7, 2016

[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

[SNAP-1034] Optimizations at Spark layer as seen in profiling #10

Conversation

sumwale commented Sep 7, 2016

What changes were proposed in this pull request?

How was this patch tested?