Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Refactor hoodie-hive registration #181

Merged
merged 1 commit into from
Jun 9, 2017

Conversation

prazanna
Copy link
Contributor

Refactor to

  1. Sync schema from the last commit (for cow - pick the schema from any of the parquet file written on the last commit and for mor - pick the schema from the log file or a compacted parquet file depending on whether the last commit was compaction or delta commit)
  2. Get all partitions written into from scanning the commit metadata from the last commit time synced to hive. List all partitions if no such metadata is found in the hive table.

@prazanna prazanna self-assigned this May 26, 2017
@prazanna prazanna requested a review from vinothchandar May 26, 2017 20:24
@prazanna prazanna mentioned this pull request May 30, 2017
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. 1 major concern over syncing merge_on_read to just ` table, instead of 2

@@ -21,30 +21,45 @@
import com.beust.jcommander.Parameter;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;

/**
* Configs needed to sync data into Hive.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the files have no license.. Can we add it in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github PR is doing this fancy thing where it is hiding the lines - you will find a small blue icon on the left which when expanded shows the license. Just checked - All these files have valid licenses in them.

@@ -21,30 +21,45 @@
import com.beust.jcommander.Parameter;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;

/**
* Configs needed to sync data into Hive.
*/
public class HiveSyncConfig implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the files have no copy right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as above

* Tool to sync new data from commits, into Hive in terms of
* Tool to sync a hoodie HDFS dataset with a hive metastore table.
* Either use it as a api HiveSyncTool.syncHoodieTable(HiveSyncConfig)
* or as a command line java -cp hoodie-hive.jar HiveSyncTool [args]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make sure alll the docs are upto date? quickstart, deltastreamer etc..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update the PR with documentation changes after i test this on the staging pipeline.

hoodieHiveClient.createTable(schema, HoodieInputFormat.class.getName(),
MapredParquetOutputFormat.class.getName(), ParquetHiveSerDe.class.getName());
break;
case MERGE_ON_READ:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for MERGE_ON_READ storage, we need to create both the tables.. One backed by HoodieInputFormat and one backed by HoodieRealtimeInputFormat.. You may want to factor this into the tool cli also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah left it as a TODO comment for now. We need to figure out some details before we register the RO table. Mostly around major compaction (we only do minor IO bound compactions). Will take this up seperately

// Get the last time we successfully synced partitions
Optional<String> lastCommitTimeSynced = Optional.empty();
if (tableExists) {
lastCommitTimeSynced = hoodieHiveClient.getLastCommitTimeSynced();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 x 💯 . (if only we can make the cleaner also do this & every other meta operation, HDFS can scale a ton more)

// read from the log file wrote
commitMetadata = HoodieCommitMetadata
.fromBytes(activeTimeline.getInstantDetails(lastDeltaCommit).get());
filePath = commitMetadata.getFileIdAndFullPaths().values().stream().filter(s -> s.contains(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: we do log the schema with each commit, so we are guaranteed to get a schema from whichever log we read.

@prazanna prazanna force-pushed the hive-sync-refactor branch from 1d04504 to f46a26a Compare June 8, 2017 18:33
@vinothchandar
Copy link
Member

Okay cool. please open a ticket for tracking double registering both RO and RT tables.. Otherwise gtg. Approve!

@prazanna
Copy link
Contributor Author

prazanna commented Jun 9, 2017

Opened #193 for that.

@prazanna prazanna merged commit db6150c into apache:master Jun 9, 2017
@prazanna prazanna deleted the hive-sync-refactor branch June 9, 2017 20:06
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants