Joinutility seperatorLogin utility separator Infobright.com
   
 
Sort rows to increase compression ratio and query speed
Posted: 08 October 2008 03:55 AM   Ignore ]  
Newbie
Rank
Total Posts:  12
Joined  2008-10-02

If you want to increase compression ratio of data stored in ICE - which usually improves also query performance - try the following trick.

When loading data, sort rows by values from a chosen column. Then, ICE will be able to achieve better compression on this column. If originally the column was difficult to compress, the overall benefit of sorting may be very significant.

Note that you do not have to sort rows in the whole table, it is enough to sort every load separately (at least every ~100K consecutive rows should be sorted).

[ Edited: 08 October 2008 04:04 AM by Marcin Wojnarski]
Signature 

Marcin Wojnarski
Infobright Inc.

Profile
 
Posted: 08 October 2008 10:03 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  700
Joined  2008-08-18

Marcin,

We have used this technique on POCs and have suggested it to those performing evaluations and it has been very effective in improving performance of test queries.  Of course, performance is best affected when the primary filter for most queries is the same as the first order sort column on a given table.  As many data warehouse and analytical queries are constrained on the time dimension (or time field), we have found that this is the best, first column to evaluate for sorting prior to load.

When loading log files, point-of-sales records, and other event-based data, the data often comes in chronological order as it is captured in real time.  But when there are multiple sources of data from the same period, sorting them together prior to load provides the benefits you have pointed out.

Thanks for the suggestion and reminder!

Signature 
Profile
 
Posted: 08 October 2008 10:20 AM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

I think it works really well (the more so as it does not require global sorting, but sorting on each load locally) until we have two important columns and sorting with respect to each one of them “destroys” the other one. There are some vendors who would suggest, in such a case, keeping multiple copies of the data, each of them sorted with respect to a different column, but it does not seem to be a clean / flexible solution and it is harmful to the overall compression ratio.

Signature 
Profile
 
Posted: 08 October 2008 12:29 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  428
Joined  2008-08-18

It is most advantageous for Infobright to query against columns that are well-ordered. For facts (transactional data) that usually means time sequence.

Presorting data is part of what’s known as ETL (extract-transform-load), which is a large part of the work in setting up a data warehouse. There are some open source tools out there such as Talend, and Pentaho Data Integration (Kettle), which can be used… Check them out!

Geoffrey

Signature 
Profile
 
Posted: 19 February 2009 06:43 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  12
Joined  2009-02-10

To presort data and to do more generally ETL, I use Talend open studio http://www.talend.com/ .

Its a good application for IT and business people, with a great GUI and performance. What I like the most is the community and the debugging process. And the price as its an open source tool.

Profile
 
Posted: 23 February 2010 01:25 AM   Ignore ]   [ # 5 ]  
Newbie
Rank
Total Posts:  1
Joined  2010-02-23

I am using Adeptia ETL Suite (http://www.adeptia.com/products/data_transformation.html)  which having powerful data conversion capability and supports ANY TO ANY conversion. It analyze data for business intelligence requires information to be aggregated from multiple sources into a data warehouse.

Profile