Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

29
Apr

Hadoop in the ETL Process

CarlGelbart's photo
by CarlGelbart     Thu, Apr 29, 2010

Hadoop is a framework for processing large volumes of data (over 1TB) in parallel across a cluster of servers.  If you have data volumes that overwhelm  ETL tools or shell scripts, this may be for you.  Generators of large amounts of raw data are; call/message detail records for telecoms, web logs for social networking sites, search engines and online retailers.

 

The idea behind this is to have a very fast and scalable way to process transformations on large amounts of raw data into manageable summarized data where it can be loaded into an analytical database like Infobright where ad-hoc data mining and reporting can take place.

 

The way MapReduce works is in the “Map” step, the master node takes the data and chops it up into smaller sets and distributes them to worker nodes. The worker nodes process the smaller sets of data and pass the status of the results back to the master node.  In the “Reduce” step, the master node combines the output of the worker nodes to form the final output that’s ready to load into the database.

 

An ETL chain will often require multiple map/reduce cycles. Joins are sometimes required, and it’s best if one of the files being joined can fit into ram. So you may need a lot of ram and a lot of master nodes to be efficient and achieve the desired throughput.

 

But understanding and coding in MapReduce requires custom programming and becomes an effort to develop, maintain and support. Certainly not as easy as what many of us are used to with ETL tools or shell scripting and stored procedures. 

 

One answer to this is Pig, an open source platform for analyzing large data sets using a high level, easy to use and extensible programming language, Pig Latin. This is a language for expressing a data analysis program in a similar way that SQL is a language for expressing a data analysis query. Many times, just a few Pig Latin statements are all that’s needed to process vast amounts of data. Its compiler generates a series of MapReduce programs for large scale parallel processing. 

 

Another tool that is used for ETL in the Hadoop framework is Hive.  Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools that enable easy ETL, generate summarizations,  a is  mechanism to put structures on the data, and also has the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called Hive QL

 

REFERENCES:

 

Hadoop - http://hadoop.apache.org/

Pig -  http://hadoop.apache.org/pig/

Hive - http://hadoop.apache.org/hive/

 

 

 



Infobright     Tags:

26
Apr

What time is it? (Or, what’s the deal with date dimensions?)

David Lutz's photo
by David Lutz     Mon, Apr 26, 2010

An Infobright evaluator recently asked, "What is the best way of defining date dimension using Infobright?" on our Community forum here http://www.infobright.org/Forums/viewthread/1506/ . In reality, and as you can read in my response, the question of designing a time (or date) dimension is generic to data warehousing.

For over 15 years I have advocated creating the most robust time dimension possible. In my experience, the time dimension is the most heavily elaborated dimension in the data model in order to eliminate as much effort on the database or business intelligence server as possible while processing user queries. Have as many columns of as many variations of the date and its derivatives in as many data types as necessary/possible. It is not unusual to have several 10’s of different explicit representations for each time element.

This is a critical dimensional table and should be as detailed as possible. Your guide should be the queries themselves and what they need. Don’t force any functions or translations onto your tool or queries - perform the function or translation once and save it in the time dimension. That means if one tool or user needs ‘month’ as an integer, save it in an integer column/data type. If another needs the numeric value for ‘month’ represented as a text value, save it again in a CHAR(2) field. It doesn’t matter - in a traditional database or Infobright. There are only 366 days in a year, maximum. For a date dimension with even 100 years’ worth of data, that’s only 36,525 records. And in Infobright, that’s less than one Data Pack (a Data Pack in Infobright contains 65,536 elements) for an egregiously deep time dimension. With compression, it’s nothing in total disk space.

For example, consider the following time dimension DDL as a ** starting point **.

CREATE TABLE dim_time (
date_id            SMALLINT,                      -- surrogate key
date               DATE,                          -- YYYY-MM-DD
date_as_text       VARCHAR(25),                   -- "January 31, 2010"
day_of_week        TINYINT,                       -- 1 through 7
day_of_month       TINYINT,                       -- 1 through 31
day_of_period      TINYINT,                       -- 1 through ...
day_of_quarter     TINYINT,                       -- 1 through 90(ish)
day_of_year        SMALLINT,                      -- 1 through 366
day_of_week_text   VARCHAR(10) COMMENT 'lookup',  -- e.g., "Monday"
week_in_month      TINYINT,                       -- 1 through 5
week_in_period     TINYINT,                       -- 1 through ...
week_in_qtr        TINYINT,                       -- 1 through 13 (or 14)
week_in_year       TINYINT,                       -- 1 through 52 (or 53)
short_day_desc     CHAR(3)     COMMENT 'lookup',  -- e.g., "MON"
long_day_desc      VARCHAR(10) COMMENT 'lookup',  -- e.g., "Monday"
month_num          TINYINT,                       -- 1 through 12
month_num_text     CHAR(2)     COMMENT 'lookup',  -- e.g., "01"
short_month_name   CHAR(3)     COMMENT 'lookup',  -- e.g., "JAN"
long_month_name    VARCHAR(10) COMMENT 'lookup',  -- e.g., "January"
qtr_num            TINYINT,                       -- 1 through 4
qtr_text           CHAR(2)     COMMENT 'lookup',  -- e.g., "01"
period_num         TINYINT,                       -- 1 through ...
period_text        CHAR(1),                       -- "1" through ...
year_num           SMALLINT,                      -- e.g., 2010
year_text          CHAR(4)     COMMENT 'lookup',  -- e.g., "2010"
year_2_digits_num  TINYINT,                       -- e.g., 10
year_2_digits_text CHAR(2)     COMMENT 'lookup',  -- e.g., "10"
holiday            CHAR(1),                       -- "Y" or "N"
national_holiday   CHAR(1),                       -- "Y" or "N"
religious_holiday  CHAR(1),                       -- "Y" or "N"
state_holiday      CHAR(1),                       -- "Y" or "N"
weekend            CHAR(1),                       -- "Y" or "N"
3_day_weekend      CHAR(1),                       -- "Y" or "N"
4_day_weekend      CHAR(1)                        -- "Y" or "N"
-- etc.
) ENGINE=BRIGHTHOUSE
;


And these are just calendar attributes. Fiscal period attributes, business calendar attributes, marketing calendar attributes, etc. all exist and cannot be determined by SQL functions. You must "bake" them into the time dimension. Once. You can create any number of years' worth of time dimension values with data generation tools like those available from vizubi.com or datamelt.com or with a variety of functions in your spreadsheet of choice, such as Microsoft Excel, OpenOffice.org Calc,Lotus 1-2-3. These are just some of the many options avaiable.

With this approach you can return whatever value, string or concatenation you need from the time dimension without adversely affecting performance for the database server or the query/reporting platform.