We used HBase export utility to take daily dump/backup of HBase tables. Dump data is further ingested to s3 to make it available for data science (i.e. DS) team. Since DS team works with a subset of data where the subset is not defined by time range. We need to read dump data and create segregated data such that it is usable by DS team. Data segregation can be achieved through spark, but being startup company we can’t afford to run continuous spark job just for data ingestion, we need something standalone that can run on spot instances and need not to be a spark based.
Sequence file are generated by below hbase export utiltiy
hbase org.apache.hadoop.hbase.mapreduce.Driver export <tablename> <destination-dir>
Below code is tested for sequence file generated on HBase 1.3.x and 1.4.x
compile group: 'org.apache.hbase', name: 'hbase-client', version: '1.4.2'
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.4.0'
compile group: 'org.apache.hadoop', name: 'hadoop-core', version: '1.2.1'
compile group: 'org.apache.hbase', name: 'hbase-mapreduce', version: '2.1.0'
Configuration conf = new Configuration();
conf.setStrings(CommonConfigurationKeysPublic.IO_SERIALIZATIONS_KEY, conf.get(CommonConfigurationKeysPublic.IO_SERIALIZATIONS_KEY),
ResultSerialization.class.getName(),
WritableSerialization.class.getName()
);
FileSystem fs = null;
try {
fs = FileSystem.get(conf);
Path inputPath = new Path("./hbase-sequence-file-reader/src/main/resources/sequencefiles/part-m-00000");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputPath, conf);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Result result = null;
while (reader.next(key)){
String skey = Bytes.toString(((ImmutableBytesWritable)key).get());
result = (Result) reader.getCurrentValue(result);
NavigableMap<byte[], byte[]> resultMap = result.getFamilyMap(Bytes.toBytes("d"));
System.out.println(skey);
resultMap.forEach((k, v) -> {
System.out.println(Bytes.toString(k) +" "+Bytes.toString(v));
});
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InstantiationException e) {
e.printStackTrace();
}
java -jar hbase-sequence-file-reader-1.0-SNAPSHOT.jar
1
address.city hyderabad
address.pincode 500081
age 27
name Chetana
2
address.city hyderabad
address.pincode 500084
age 25
name Nilesh