-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[WIP] Fix for issue #333 #386
Conversation
// entries with the same exact in memory copy of the HoodieRecord and the 2 separate filenames that the | ||
// record is found in. This will result in setting currentLocation 2 times and it will fail the second time. | ||
// So creating a new in memory copy of the hoodie record. | ||
HoodieRecord<T> record = new HoodieRecord<>(v1._1()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking if the currentLocation is set, I think creating the record every time is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the implication of doing this ? Will this trigger a huge number of objects being created ?
boolean copyOldRecord = true; | ||
if (keyToNewRecords.containsKey(key)) { | ||
// If we have duplicate records that we are updating, then the hoodie record will be deflated after | ||
// writing the first record. So make a copy of the record to be merged | ||
HoodieRecord<T> hoodieRecord = new HoodieRecord<>(keyToNewRecords.get(key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test did not catch this, as HoodieMergeHandle uses SpillableMap and the get here uses the DiskBasedMap and hence returns a new HoodieRecord everytime. If the get, gets the value from InMemoryMap, then there will be an issue when there are duplicate records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code is very specific to this use-case and will make re-factoring of the code tricky, also again what is the garbage collection implication of this ?
* Create new hoodie record from an existing record. | ||
* @param record | ||
*/ | ||
public HoodieRecord(HoodieRecord<T> record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chain constructors here. this(key, data) and then set the locations.
@suniluber please squash your commits into a single one, after addressing @n3nash 's comments.. |
@suniluber Any updates on this ? |
29d3579
to
beaff40
Compare
beaff40
to
70b53cf
Compare
@n3nash @vinothchandar @ovj @bvaradar @jianxu I ingested about 7.5 million records in a single parquet file of size 1G |
@suniluber Thanks for the numbers. From your test, it seems like we tested 20K updates in HoodieMergeHandle and 7.5 million inserts. For us to be sure that the GC time isn't significant, we need to perform this test for a high number of updates (since that is where the keyToRecordsMap() will be large and lead to more new HoodieRecords being created). |
Closing due to inactivity |
#333