Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

29
Apr

Hadoop in the ETL Process

CarlGelbart's photo
by CarlGelbart     Thu, Apr 29, 2010

Hadoop is a framework for processing large volumes of data (over 1TB) in parallel across a cluster of servers.  If you have data volumes that overwhelm  ETL tools or shell scripts, this may be for you.  Generators of large amounts of raw data are; call/message detail records for telecoms, web logs for social networking sites, search engines and online retailers.

 

The idea behind this is to have a very fast and scalable way to process transformations on large amounts of raw data into manageable summarized data where it can be loaded into an analytical database like Infobright where ad-hoc data mining and reporting can take place.

 

The way MapReduce works is in the “Map” step, the master node takes the data and chops it up into smaller sets and distributes them to worker nodes. The worker nodes process the smaller sets of data and pass the status of the results back to the master node.  In the “Reduce” step, the master node combines the output of the worker nodes to form the final output that’s ready to load into the database.

 

An ETL chain will often require multiple map/reduce cycles. Joins are sometimes required, and it’s best if one of the files being joined can fit into ram. So you may need a lot of ram and a lot of master nodes to be efficient and achieve the desired throughput.

 

But understanding and coding in MapReduce requires custom programming and becomes an effort to develop, maintain and support. Certainly not as easy as what many of us are used to with ETL tools or shell scripting and stored procedures. 

 

One answer to this is Pig, an open source platform for analyzing large data sets using a high level, easy to use and extensible programming language, Pig Latin. This is a language for expressing a data analysis program in a similar way that SQL is a language for expressing a data analysis query. Many times, just a few Pig Latin statements are all that’s needed to process vast amounts of data. Its compiler generates a series of MapReduce programs for large scale parallel processing. 

 

Another tool that is used for ETL in the Hadoop framework is Hive.  Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools that enable easy ETL, generate summarizations,  a is  mechanism to put structures on the data, and also has the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called Hive QL

 

REFERENCES:

 

Hadoop - http://hadoop.apache.org/

Pig -  http://hadoop.apache.org/pig/

Hive - http://hadoop.apache.org/hive/

 

 

 



Infobright     Tags: etl, hadoop
Please login or register to post a comment.