Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[WIP] Fix for issue #333 #386

Closed
wants to merge 1 commit into from

Conversation

suniluber
Copy link
Contributor

@suniluber
Copy link
Contributor Author

// entries with the same exact in memory copy of the HoodieRecord and the 2 separate filenames that the
// record is found in. This will result in setting currentLocation 2 times and it will fail the second time.
// So creating a new in memory copy of the hoodie record.
HoodieRecord<T> record = new HoodieRecord<>(v1._1());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if the currentLocation is set, I think creating the record every time is cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implication of doing this ? Will this trigger a huge number of objects being created ?

boolean copyOldRecord = true;
if (keyToNewRecords.containsKey(key)) {
// If we have duplicate records that we are updating, then the hoodie record will be deflated after
// writing the first record. So make a copy of the record to be merged
HoodieRecord<T> hoodieRecord = new HoodieRecord<>(keyToNewRecords.get(key));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test did not catch this, as HoodieMergeHandle uses SpillableMap and the get here uses the DiskBasedMap and hence returns a new HoodieRecord everytime. If the get, gets the value from InMemoryMap, then there will be an issue when there are duplicate records.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is very specific to this use-case and will make re-factoring of the code tricky, also again what is the garbage collection implication of this ?

* Create new hoodie record from an existing record.
* @param record
*/
public HoodieRecord(HoodieRecord<T> record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chain constructors here. this(key, data) and then set the locations.

@vinothchandar
Copy link
Member

@suniluber please squash your commits into a single one, after addressing @n3nash 's comments..

@n3nash
Copy link
Contributor

n3nash commented May 15, 2018

@suniluber Any updates on this ?

@suniluber suniluber force-pushed the updatemergehandle branch from 29d3579 to beaff40 Compare May 15, 2018 23:06
@suniluber suniluber force-pushed the updatemergehandle branch from beaff40 to 70b53cf Compare May 15, 2018 23:52
@suniluber
Copy link
Contributor Author

suniluber commented May 15, 2018

@n3nash @vinothchandar @ovj @bvaradar @jianxu
I tested the changes for GC issues.

I ingested about 7.5 million records in a single parquet file of size 1G
I then generated 20K deletion messages and about 5 million new messages and ingested.
This updated the 20K records and created new 1G file and created a 2nd file for the inserts. Did not observe any significant changes for GC times.

@n3nash
Copy link
Contributor

n3nash commented May 16, 2018

@suniluber Thanks for the numbers. From your test, it seems like we tested 20K updates in HoodieMergeHandle and 7.5 million inserts. For us to be sure that the GC time isn't significant, we need to perform this test for a high number of updates (since that is where the keyToRecordsMap() will be large and lead to more new HoodieRecords being created).
Same issue with the BloomIndexLookup step of tagLocationBacktoRecords(), the high GC here will be seen when the number of record updates is large (leading to may records needing tagging ) and hence leading to a high number of new HoodieRecords being created.
Can you perform this test with a high number of updates ? Also, to be able to concretely say there is no GC time increase, please run this with the same batch of data and add the GC numbers here for both with you change vs without your change so we know what % is GC increase, even if negligible. thanks

@vinothchandar vinothchandar changed the title Fix for issue #333 [WIP] Fix for issue #333 Sep 28, 2018
@vinothchandar
Copy link
Member

Closing due to inactivity

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants