Hadoop is a framework for processing large volumes of data (over 1TB) in parallel across a cluster of servers. If you have data volumes that overwhelm ETL tools or shell scripts, this may be for you. Generators of large amounts of raw data are; call/message detail records for telecoms, web logs for social networking sites, search engines and online retailers.
The idea behind this is to have a very fast and scalable way to process transformations on large amounts of raw data into manageable summarized data where it can be loaded into an analytical database like Infobright where ad-hoc data mining and reporting can take place.
The way MapReduce works is in the “Map” step, the master node takes the data and chops it up into smaller sets and distributes them to worker nodes. The worker nodes process the smaller sets of data and pass the status of the results back to the master node. In the “Reduce” step, the master node combines the output of the worker nodes to form the final output that’s ready to load into the database.
An ETL chain will often require multiple map/reduce cycles. Joins are sometimes required, and it’s best if one of the files being joined can fit into ram. So you may need a lot of ram and a lot of master nodes to be efficient and achieve the desired throughput.
But understanding and coding in MapReduce requires custom programming and becomes an effort to develop, maintain and support. Certainly not as easy as what many of us are used to with ETL tools or shell scripting and stored procedures.
One answer to this is Pig, an open source platform for analyzing large data sets using a high level, easy to use and extensible programming language, Pig Latin. This is a language for expressing a data analysis program in a similar way that SQL is a language for expressing a data analysis query. Many times, just a few Pig Latin statements are all that’s needed to process vast amounts of data. Its compiler generates a series of MapReduce programs for large scale parallel processing.
Another tool that is used for ETL in the Hadoop framework is Hive. Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools that enable easy ETL, generate summarizations, a is mechanism to put structures on the data, and also has the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called Hive QL
REFERENCES:
Hadoop - http://hadoop.apache.org/
Pig - http://hadoop.apache.org/pig/
Hive - http://hadoop.apache.org/hive/
Comments (0)
An Infobright evaluator recently asked, "What is the best way of defining date dimension using Infobright?" on our Community forum here http://www.infobright.org/Forums/viewthread/1506/ . In reality, and as you can read in my response, the question of designing a time (or date) dimension is generic to data warehousing.
For over 15 years I have advocated creating the most robust time dimension possible. In my experience, the time dimension is the most heavily elaborated dimension in the data model in order to eliminate as much effort on the database or business intelligence server as possible while processing user queries. Have as many columns of as many variations of the date and its derivatives in as many data types as necessary/possible. It is not unusual to have several 10’s of different explicit representations for each time element.
This is a critical dimensional table and should be as detailed as possible. Your guide should be the queries themselves and what they need. Don’t force any functions or translations onto your tool or queries - perform the function or translation once and save it in the time dimension. That means if one tool or user needs ‘month’ as an integer, save it in an integer column/data type. If another needs the numeric value for ‘month’ represented as a text value, save it again in a CHAR(2) field. It doesn’t matter - in a traditional database or Infobright. There are only 366 days in a year, maximum. For a date dimension with even 100 years’ worth of data, that’s only 36,525 records. And in Infobright, that’s less than one Data Pack (a Data Pack in Infobright contains 65,536 elements) for an egregiously deep time dimension. With compression, it’s nothing in total disk space.
For example, consider the following time dimension DDL as a ** starting point **.
CREATE TABLE dim_time (
date_id SMALLINT, -- surrogate key
date DATE, -- YYYY-MM-DD
date_as_text VARCHAR(25), -- "January 31, 2010"
day_of_week TINYINT, -- 1 through 7
day_of_month TINYINT, -- 1 through 31
day_of_period TINYINT, -- 1 through ...
day_of_quarter TINYINT, -- 1 through 90(ish)
day_of_year SMALLINT, -- 1 through 366
day_of_week_text VARCHAR(10) COMMENT 'lookup', -- e.g., "Monday"
week_in_month TINYINT, -- 1 through 5
week_in_period TINYINT, -- 1 through ...
week_in_qtr TINYINT, -- 1 through 13 (or 14)
week_in_year TINYINT, -- 1 through 52 (or 53)
short_day_desc CHAR(3) COMMENT 'lookup', -- e.g., "MON"
long_day_desc VARCHAR(10) COMMENT 'lookup', -- e.g., "Monday"
month_num TINYINT, -- 1 through 12
month_num_text CHAR(2) COMMENT 'lookup', -- e.g., "01"
short_month_name CHAR(3) COMMENT 'lookup', -- e.g., "JAN"
long_month_name VARCHAR(10) COMMENT 'lookup', -- e.g., "January"
qtr_num TINYINT, -- 1 through 4
qtr_text CHAR(2) COMMENT 'lookup', -- e.g., "01"
period_num TINYINT, -- 1 through ...
period_text CHAR(1), -- "1" through ...
year_num SMALLINT, -- e.g., 2010
year_text CHAR(4) COMMENT 'lookup', -- e.g., "2010"
year_2_digits_num TINYINT, -- e.g., 10
year_2_digits_text CHAR(2) COMMENT 'lookup', -- e.g., "10"
holiday CHAR(1), -- "Y" or "N"
national_holiday CHAR(1), -- "Y" or "N"
religious_holiday CHAR(1), -- "Y" or "N"
state_holiday CHAR(1), -- "Y" or "N"
weekend CHAR(1), -- "Y" or "N"
3_day_weekend CHAR(1), -- "Y" or "N"
4_day_weekend CHAR(1) -- "Y" or "N"
-- etc.
) ENGINE=BRIGHTHOUSE
;
And these are just calendar attributes. Fiscal period attributes, business calendar attributes, marketing calendar attributes, etc. all exist and cannot be determined by SQL functions. You must "bake" them into the time dimension. Once. You can create any number of years' worth of time dimension values with data generation tools like those available from vizubi.com or datamelt.com or with a variety of functions in your spreadsheet of choice, such as Microsoft Excel, OpenOffice.org Calc,Lotus 1-2-3. These are just some of the many options avaiable.
With this approach you can return whatever value, string or concatenation you need from the time dimension without adversely affecting performance for the database server or the query/reporting platform.