Refactor hoodie-hive registration #181

prazanna · 2017-05-26T20:24:37Z

Refactor to

Sync schema from the last commit (for cow - pick the schema from any of the parquet file written on the last commit and for mor - pick the schema from the log file or a compacted parquet file depending on whether the last commit was compaction or delta commit)
Get all partitions written into from scanning the commit metadata from the last commit time synced to hive. List all partitions if no such metadata is found in the hive table.

vinothchandar

Looks good. 1 major concern over syncing merge_on_read to just ` table, instead of 2

vinothchandar · 2017-05-31T14:13:53Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HiveSyncConfig.java

@@ -21,30 +21,45 @@
 import com.beust.jcommander.Parameter;

 import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;

 /**
 * Configs needed to sync data into Hive.
 */


A lot of the files have no license.. Can we add it in

Github PR is doing this fancy thing where it is hiding the lines - you will find a small blue icon on the left which when expanded shows the license. Just checked - All these files have valid licenses in them.

vinothchandar · 2017-05-31T14:14:23Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HiveSyncConfig.java

@@ -21,30 +21,45 @@
 import com.beust.jcommander.Parameter;

 import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;

 /**
 * Configs needed to sync data into Hive.
 */
 public class HiveSyncConfig implements Serializable {


A lot of the files have no copy right?

Same thing as above

vinothchandar · 2017-05-31T14:14:50Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HiveSyncTool.java

- * Tool to sync new data from commits, into Hive in terms of
+ * Tool to sync a hoodie HDFS dataset with a hive metastore table.
+ * Either use it as a api HiveSyncTool.syncHoodieTable(HiveSyncConfig)
+ * or as a command line java -cp hoodie-hive.jar HiveSyncTool [args]


Can you also make sure alll the docs are upto date? quickstart, deltastreamer etc..

Good point. I will update the PR with documentation changes after i test this on the staging pipeline.

vinothchandar · 2017-05-31T14:18:16Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HiveSyncTool.java

+          hoodieHiveClient.createTable(schema, HoodieInputFormat.class.getName(),
+              MapredParquetOutputFormat.class.getName(), ParquetHiveSerDe.class.getName());
+          break;
+        case MERGE_ON_READ:


for MERGE_ON_READ storage, we need to create both the tables.. One backed by HoodieInputFormat and one backed by HoodieRealtimeInputFormat.. You may want to factor this into the tool cli also

Yeah left it as a TODO comment for now. We need to figure out some details before we register the RO table. Mostly around major compaction (we only do minor IO bound compactions). Will take this up seperately

vinothchandar · 2017-05-31T14:20:56Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HiveSyncTool.java

+    // Get the last time we successfully synced partitions
+    Optional<String> lastCommitTimeSynced = Optional.empty();
+    if (tableExists) {
+      lastCommitTimeSynced = hoodieHiveClient.getLastCommitTimeSynced();


👍 x 💯 . (if only we can make the cleaner also do this & every other meta operation, HDFS can scale a ton more)

vinothchandar · 2017-05-31T14:29:04Z

hoodie-hive/src/main/java/com/uber/hoodie/hive/HoodieHiveClient.java

+            // read from the log file wrote
+            commitMetadata = HoodieCommitMetadata
+                .fromBytes(activeTimeline.getInstantDetails(lastDeltaCommit).get());
+            filePath = commitMetadata.getFileIdAndFullPaths().values().stream().filter(s -> s.contains(


Note to self: we do log the schema with each commit, so we are guaranteed to get a schema from whichever log we read.

vinothchandar · 2017-06-09T18:08:29Z

Okay cool. please open a ticket for tracking double registering both RO and RT tables.. Otherwise gtg. Approve!

prazanna · 2017-06-09T20:06:21Z

Opened #193 for that.

prazanna self-assigned this May 26, 2017

prazanna requested a review from vinothchandar May 26, 2017 20:24

prazanna mentioned this pull request May 30, 2017

Hoodie Hive Sync #150

Closed

vinothchandar reviewed May 31, 2017

View reviewed changes

Refactor hoodie-hive

f46a26a

prazanna force-pushed the hive-sync-refactor branch from 1d04504 to f46a26a Compare June 8, 2017 18:33

prazanna merged commit db6150c into apache:master Jun 9, 2017

prazanna deleted the hive-sync-refactor branch June 9, 2017 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hoodie-hive registration #181

Refactor hoodie-hive registration #181

prazanna commented May 26, 2017

vinothchandar left a comment

vinothchandar May 31, 2017

prazanna Jun 9, 2017

vinothchandar May 31, 2017

prazanna Jun 9, 2017

vinothchandar May 31, 2017

prazanna Jun 9, 2017

vinothchandar May 31, 2017

prazanna Jun 9, 2017

vinothchandar May 31, 2017

vinothchandar May 31, 2017

vinothchandar commented Jun 9, 2017

prazanna commented Jun 9, 2017

Refactor hoodie-hive registration #181

Refactor hoodie-hive registration #181

Conversation

prazanna commented May 26, 2017

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Jun 9, 2017

prazanna commented Jun 9, 2017