My learning phase with Hadoop is still continuing. During this phase what I found is a great lack of a comprehensive POC which covers at least a few prominent Hadoop technologies. My POC can fill up that void. After having set up CDH4.7 in my laptop, I completed implementation of this POC touching HDFS API, MapReduce, JSON and AVRO SerDe, HBase.
Input is a JSON data file (an email message inbox)which has to be loaded in HDFS and then email messages must be stored into HBase table as AVRO blob using a MapReduce Job. Coding should be as generic as possible so that new field can be added to JSON with minimum coding change; HBase design must avoid region hot spotting; HBase design must support time ranged queries.
POC Objectives & Achievements
- Put one JSON file in HDFS --> we should have date somewhere in the JSON, and it should be hierarchical at least 2 levels deep. Need to write a java class that will read multiple files from local files system and copy them to HDFS(like CopyFromLocal does).
JSON data resembles a mail inbox where each JSON record is an email message with details like from, cc, subject, timestamp etc.
- Design HBase table and RowKey to avoid region server hot spotting.
Used HBase shell utility to create a table with pre splitting of region to evade hot spotting. Table has only one column family which will have only one column qualifier. Each cell value will be an Avro blob representing one JSON record (one email message). Salting is used to create RowKey as the measure taken (along with pre splitting) to avoid hot spotting.
- Write an MR that will create AVRO Blobs reading that JSON file and put the same in HBase column. One record in JSON corresponds to One row in HBase. Each row in HBase should have one blob and one row key.
- Avro schema file is loaded into distributed cache programmatically at the time of Job Client configuration.
- Custom splitter used to supply one JSON record to each Map instance. That custom splitter is set at the time of Job Client configuration.
- HBase connection is made programmatically in Job Client using HBaseConfiguration API.
- Map function parses JSON record to take out message id and timestamp. Then we generate salting prefix to create row key.
- Row key design (salting prefix | timestamp | messageid) evades hot spotting and supports short time range scan.
- Avro API with Avro schema used to serialize JSON record into Avro blob. Code is generic in a sense that Avro blob generation is not dependent on JSON field names. So going forward, add a new field into JSON will have no code impact.
- Map emits RowKey (for HBase row) as key and Avro blob as value.
- Reducer used TableReducer to insert row into HBase table. We create Put inside the reducer and write the Put in Context.
- Write a Java program using HBase API to read HBase record, convert the column data (blob) into JSON and write the same in a file using java.io API
- We used FuzzyRowFilter HBase API to read efficiently from the spread out (using Salting) HBase records.
- Used generic Avro deserializer to convert Avro blob into JSON data.
- Successful set up of CDH 4.7 in Ubuntu 12.04
- Successful implementation of Custom Splitting, Distribute Cache, accessing cached file using Link Name
- Successful implementation of TableReducer to write row in HBase table using Reducer component
- Generic Avro (to/from JSON) serialization and deserialization
- JSON parsing using Jackson parser
- Successful implementation of FuzzyRowFilter to read from spread-out and salted HBase table
- Pre-splitting of HBase table and successful use of RowKey salting to prevent hot spotting
- Successful integration of HBase with HBase managed Zookeeper ensemble by proper configuration in hbase-site.xml
All code samples of POC covering HDFS API, MapReduce, JSON and AVRO SerDe, HBase API With FuzzyRowFilter usage are available on GitHub.