Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[HUDI-6092] Reuse schema objects while deserializing log blocks. #8484

Merged
merged 1 commit into from
Apr 22, 2023

Conversation

prashantwason
Copy link
Member

[HUDI-6092] Reuse schema objects while deserializing log blocks.

Change Logs

  1. Added a ConcurrentHashMap in HoodieDataBlock to hold schema string to schema object mapping
  2. In HoodieHFileDataBlock and HoodieAvroDataBlock, use the above map to retrive the schema object rather than parsing the schema every time.

Also introduced some try { } blocks in code to auto close resources which were being leaked.

Impact

When reading log files with a very large number of log blocks, there is reduced memory consumption.

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed


return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the iterator still valid when the outer reader has been closed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I will backup these changes which introduce try block.

I think the intention is to return an iterator which will close the reader after iteration (hence the name ClosableIterator).

@danny0405 danny0405 self-assigned this Apr 18, 2023
@danny0405 danny0405 added code-refactor writer-core Issues relating to core transactions/write actions priority:minor everything else; usability gaps; questions; feature reqs labels Apr 18, 2023
@prashantwason prashantwason force-pushed the pw_reuse_schema_objects branch from 952ed0d to 04ccb55 Compare April 20, 2023 06:53
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danny0405
Copy link
Contributor

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 09a7953 into apache:master Apr 22, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
code-refactor priority:minor everything else; usability gaps; questions; feature reqs writer-core Issues relating to core transactions/write actions
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants