Hadoop Hands on - A POC Covering HDFS API, MapReduce, JSON and AVRO SerDe, HBase API With FuzzyRowFilter usage

My learning phase with Hadoop is still continuing. During this phase what I found is a great lack of a comprehensive POC which covers at least a few prominent Hadoop technologies. My POC can fill up that void. After having set up CDH4.7 in my laptop, I completed implementation of this POC touching HDFS API, MapReduce, JSON and AVRO SerDe, HBase.

POC Outline

Input is a JSON data file (an email message inbox)which has to be loaded in HDFS and then email messages must be stored into HBase table as AVRO blob using a MapReduce Job. Coding should be as generic as possible so that new field can be added to JSON with minimum coding change; HBase design must avoid region hot spotting; HBase design must support time ranged queries.

POC Objectives & Achievements

Put one JSON file in HDFS --> we should have date somewhere in the JSON, and it should be hierarchical at least 2 levels deep. Need to write a java class that will read multiple files from local files system and copy them to HDFS(like CopyFromLocal does).
Achieved:
JSON data resembles a mail inbox where each JSON record is an email message with details like from, cc, subject, timestamp etc.
Design HBase table and RowKey to avoid region server hot spotting.
Achieved:
Used HBase shell utility to create a table with pre splitting of region to evade hot spotting. Table has only one column family which will have only one column qualifier. Each cell value will be an Avro blob representing one JSON record (one email message). Salting is used to create RowKey as the measure taken (along with pre splitting) to avoid hot spotting.
Write an MR that will create AVRO Blobs reading that JSON file and put the same in HBase column. One record in JSON corresponds to One row in HBase. Each row in HBase should have one blob and one row key.
Achieved:
1. Avro schema file is loaded into distributed cache programmatically at the time of Job Client configuration.
2. Custom splitter used to supply one JSON record to each Map instance. That custom splitter is set at the time of Job Client configuration.
3. HBase connection is made programmatically in Job Client using HBaseConfiguration API.
4. Map function parses JSON record to take out message id and timestamp. Then we generate salting prefix to create row key.
5. Row key design (salting prefix | timestamp | messageid) evades hot spotting and supports short time range scan.
6. Avro API with Avro schema used to serialize JSON record into Avro blob. Code is generic in a sense that Avro blob generation is not dependent on JSON field names. So going forward, add a new field into JSON will have no code impact.
7. Map emits RowKey (for HBase row) as key and Avro blob as value.
8. Reducer used TableReducer to insert row into HBase table. We create Put inside the reducer and write the Put in Context.
Write a Java program using HBase API to read HBase record, convert the column data (blob) into JSON and write the same in a file using java.io API
Achieved:
1. We used FuzzyRowFilter HBase API to read efficiently from the spread out (using Salting) HBase records.
2. Used generic Avro deserializer to convert Avro blob into JSON data.

Technical Highlights

Successful set up of CDH 4.7 in Ubuntu 12.04
Successful implementation of Custom Splitting, Distribute Cache, accessing cached file using Link Name
Successful implementation of TableReducer to write row in HBase table using Reducer component
Generic Avro (to/from JSON) serialization and deserialization
JSON parsing using Jackson parser
Successful implementation of FuzzyRowFilter to read from spread-out and salted HBase table
Pre-splitting of HBase table and successful use of RowKey salting to prevent hot spotting
Successful integration of HBase with HBase managed Zookeeper ensemble by proper configuration in hbase-site.xml

Download SrcCodes

All code samples of POC covering HDFS API, MapReduce, JSON and AVRO SerDe, HBase API With FuzzyRowFilter usage are available on GitHub.

, Apache Hadoop , HBase , Zookeeper , FuzzyRowFilter , Mapreduce , AVRO SerDe , HDFS , CDH , classic

Hadoop Hands on - A POC Covering HDFS API, MapReduce, JSON and AVRO SerDe, HBase API With FuzzyRowFilter usage

POC Outline

POC Objectives & Achievements

Technical Highlights

Download SrcCodes

About the Author

Soumen Chandra

Comments

Other Posts

Install and Configure Oracle Java ME Embedded on Raspberry Pi

Cloudera Hadoop (CDH 5.x) Installation Guide

Press ESC to close

POC Outline

POC Objectives & Achievements

Technical Highlights

Download SrcCodes

About the Author

Comments

You might also like

Other Posts

Install and Configure Oracle Java ME Embedded on Raspberry Pi

Cloudera Hadoop (CDH 5.x) Installation Guide