Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

29
Apr

Hadoop in the ETL Process

CarlGelbart's photo
by CarlGelbart     Thu, Apr 29, 2010

Hadoop is a framework for processing large volumes of data (over 1TB) in parallel across a cluster of servers.  If you have data volumes that overwhelm  ETL tools or shell scripts, this may be for you.  Generators of large amounts of raw data are; call/message detail records for telecoms, web logs for social networking sites, search engines and online retailers.

 

The idea behind this is to have a very fast and scalable way to process transformations on large amounts of raw data into manageable summarized data where it can be loaded into an analytical database like Infobright where ad-hoc data mining and reporting can take place.

 

The way MapReduce works is in the “Map” step, the master node takes the data and chops it up into smaller sets and distributes them to worker nodes. The worker nodes process the smaller sets of data and pass the status of the results back to the master node.  In the “Reduce” step, the master node combines the output of the worker nodes to form the final output that’s ready to load into the database.

 

An ETL chain will often require multiple map/reduce cycles. Joins are sometimes required, and it’s best if one of the files being joined can fit into ram. So you may need a lot of ram and a lot of master nodes to be efficient and achieve the desired throughput.

 

But understanding and coding in MapReduce requires custom programming and becomes an effort to develop, maintain and support. Certainly not as easy as what many of us are used to with ETL tools or shell scripting and stored procedures. 

 

One answer to this is Pig, an open source platform for analyzing large data sets using a high level, easy to use and extensible programming language, Pig Latin. This is a language for expressing a data analysis program in a similar way that SQL is a language for expressing a data analysis query. Many times, just a few Pig Latin statements are all that’s needed to process vast amounts of data. Its compiler generates a series of MapReduce programs for large scale parallel processing. 

 

Another tool that is used for ETL in the Hadoop framework is Hive.  Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools that enable easy ETL, generate summarizations,  a is  mechanism to put structures on the data, and also has the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called Hive QL

 

REFERENCES:

 

Hadoop - http://hadoop.apache.org/

Pig -  http://hadoop.apache.org/pig/

Hive - http://hadoop.apache.org/hive/

 

 

 



Infobright     Tags: etl, hadoop

27
Apr

SELECT TOP 100 tunes FROM ever;

David Lutz's photo
by David Lutz     Tue, Apr 27, 2010

[spoiler alert]  This post has nothing to do with database technology. [end spolier alert]


For years I collected tens of thousands of songs requiring over a terabyte of disk space, not including backups.  And I used every conceivable music manager and player on Microsoft's Windows platforms - except iTunes.  But when I got my first Macbook Pro a few years ago I decided to finally give iTunes a try.  I can't believe what I was missing!

Besides all the other cool things that iTunes does - Cover Flow, automatic acquisition of album artwork and other details - I especially enjoy using what I call its "subset" features.  (Maybe this is a little bit database-y.)  You know, songs with similar beats per minute, same ranking, same genre.  But most importantly, Playlists.

And that brings me to both my main topic and a request for all of you.

I decided to create a Top 10 All Time Best Songs Ever playlist.  I should have known before I even added track #11 that this was a ridiculously small number for all recorded music spanning all styles, countries, and time.  So it quickly grew to a Top 50 list just by adding my 5-Star rated songs.  And now I've decided to make it a Top 100 list, with your help.

OK, so I live in Texas and have a natural familiarity with popular music in North America but have also lived in Europe, Central America and South America so I have some exposure and an appreciation for music from all over the world.  So here (hear?) are some example tracks from my list.

99 Luftballons, Nena

Bang on the Drum All Day, Todd Rundgren

Cheap Sunglasses, ZZ Top

De Noite Na Cama, Marisa Monte

El Sol No Regresa, La 5ª  Estación

Fins, Jimmy Buffett

(I'm Gonna Be) 500 Miles, The Proclaimers

King of the Road, Roger Miller

Piel Canela, Natalia LaFourcade (y La Forquetina)

Senses Working Overtime, XTC

I'm looking to the Infobright Community to help me flesh out the remaining 50 songs on my list.  I also have a preference for live performances.  Extra points if you can help me locate an electronic recording!

Thanks in advance!

Infobright     Tags:

Next Page