-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improving out of box experience for data source #295
Conversation
vinothchandar
commented
Jan 5, 2018
- Fixes Default Configuration Changes to Hoodie to process large upserts #246
- Bump up default parallelism to 1500, to handle large upserts
- Add docs on s3 confuration & tuning tips with tested spark knobs
- Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
- Improve speed of ROTablePathFilter by removing directory check
- Move to spark-avro 4.0 to handle issue with nested fields with same name
- Keep AvroConversionUtils in sync with spark-avro 4.0
tested this with upto 400GB of input, shuffling upto 1TB intermediate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Will test with schema involving all avro data-types.
@@ -42,7 +42,7 @@ | |||
private static final String BASE_PATH_PROP = "hoodie.base.path"; | |||
private static final String AVRO_SCHEMA = "hoodie.avro.schema"; | |||
public static final String TABLE_NAME = "hoodie.table.name"; | |||
private static final String DEFAULT_PARALLELISM = "200"; | |||
private static final String DEFAULT_PARALLELISM = "1500"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. 200 is too small since spark partitions need to be less than 2GB and we 'd like to be able to do 500GB upserts in a stable way out of box.
@n3nash do you see any issues (esp config default changes) with merging this? |
|
||
spark.kryoserializer.buffer.max 512m | ||
spark.serializer org.apache.spark.serializer.KryoSerializer | ||
spark.shuffle.memoryFraction 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is deprecated. remove.
spark.shuffle.memoryFraction 0.2 | ||
spark.shuffle.service.enabled true | ||
spark.sql.hive.convertMetastoreParquet false | ||
spark.storage.memoryFraction 0.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remove..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash you may want to double check configs to see if we are still setting old props..
Also anything you can add here for reliable spark configs would be appreciated.
|
||
``` | ||
spark.driver.extraClassPath /etc/hive/conf | ||
spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include gc tuning knobs
- Fixes apache#246 - Bump up default parallelism to 1500, to handle large upserts - Add docs on s3 confuration & tuning tips with tested spark knobs - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset - Improve speed of ROTablePathFilter by removing directory check - Move to spark-avro 4.0 to handle issue with nested fields with same name - Keep AvroConversionUtils in sync with spark-avro 4.0
- hoodie-hadoop-mr now needs objectsize bundled - Also updated docs with additional tuning tips