-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Refactor hoodie-hive registration #181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. 1 major concern over syncing merge_on_read
to just ` table, instead of 2
@@ -21,30 +21,45 @@ | |||
import com.beust.jcommander.Parameter; | |||
|
|||
import java.io.Serializable; | |||
import java.util.ArrayList; | |||
import java.util.List; | |||
|
|||
/** | |||
* Configs needed to sync data into Hive. | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the files have no license.. Can we add it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github PR is doing this fancy thing where it is hiding the lines - you will find a small blue icon on the left which when expanded shows the license. Just checked - All these files have valid licenses in them.
@@ -21,30 +21,45 @@ | |||
import com.beust.jcommander.Parameter; | |||
|
|||
import java.io.Serializable; | |||
import java.util.ArrayList; | |||
import java.util.List; | |||
|
|||
/** | |||
* Configs needed to sync data into Hive. | |||
*/ | |||
public class HiveSyncConfig implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the files have no copy right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing as above
* Tool to sync new data from commits, into Hive in terms of | ||
* Tool to sync a hoodie HDFS dataset with a hive metastore table. | ||
* Either use it as a api HiveSyncTool.syncHoodieTable(HiveSyncConfig) | ||
* or as a command line java -cp hoodie-hive.jar HiveSyncTool [args] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make sure alll the docs are upto date? quickstart, deltastreamer etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update the PR with documentation changes after i test this on the staging pipeline.
hoodieHiveClient.createTable(schema, HoodieInputFormat.class.getName(), | ||
MapredParquetOutputFormat.class.getName(), ParquetHiveSerDe.class.getName()); | ||
break; | ||
case MERGE_ON_READ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for MERGE_ON_READ storage, we need to create both the tables.. One backed by HoodieInputFormat and one backed by HoodieRealtimeInputFormat.. You may want to factor this into the tool cli also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah left it as a TODO comment for now. We need to figure out some details before we register the RO table. Mostly around major compaction (we only do minor IO bound compactions). Will take this up seperately
// Get the last time we successfully synced partitions | ||
Optional<String> lastCommitTimeSynced = Optional.empty(); | ||
if (tableExists) { | ||
lastCommitTimeSynced = hoodieHiveClient.getLastCommitTimeSynced(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 x 💯 . (if only we can make the cleaner also do this & every other meta operation, HDFS can scale a ton more)
// read from the log file wrote | ||
commitMetadata = HoodieCommitMetadata | ||
.fromBytes(activeTimeline.getInstantDetails(lastDeltaCommit).get()); | ||
filePath = commitMetadata.getFileIdAndFullPaths().values().stream().filter(s -> s.contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: we do log the schema with each commit, so we are guaranteed to get a schema from whichever log we read.
1d04504
to
f46a26a
Compare
Okay cool. please open a ticket for tracking double registering both RO and RT tables.. Otherwise gtg. Approve! |
Opened #193 for that. |
Refactor to