Joinutility seperatorLogin utility separator Infobright.com

Academic Blog

03
May

Organizing data and more about rough data contest

Dominik Slezak's photo
by Dominik Slezak     Sun, May 03, 2009

Infobright's query performance depends on quality of the Knowledge Grid. Definition of quality of particular Knowledge Nodes should reflect their role in query optimization and execution. Quality may be related to average closeness of the minimum and maximum values in Data Pack Nodes, to the percentage of zeros in our Histograms, Character Maps and Pack-to-Pack Nodes et cetera.

Quality of the Knowledge Grid depends on organizing the incoming data rows into Data Packs, more precisely, the policy of putting particular rows into particular 2^16 row groups that are then transformed into collections of compressed Data Packs. In the current implementation, while loading data into Infobright, the initial row ordering is not changed. Surely, one can pre-order data prior to load but it may not necessarily improve all Knowledge Nodes that are useful in particular query workloads. Actually, pre-ordering will improve only a fraction of Knowledge Nodes and make the others worse. Moreover, it will significantly slow down the data load. Hence, we started implementing alternative algorithms for on-fly organization of rows that will improve quality of the whole Knowledge Grid, with no harm to the data load speed.

Let me skip the algorithmic details. There will be more about them at SIGMOD and/or VLDB this year. Instead, let me focus on the results. Consider the following example:

Some time ago we were writing about Rough Data Contest organized as a part of the academic conferences to be held this December in Delhi. I posted the rough data to be analyzed by the contest participants, with the rows corresponding to the 2^16 row groups taken from one of our benchmark datasets, and with the columns corresponding to the min/max statistics stored in Infobright’s Data Pack Nodes. However, I realized that the majority of information about dependencies between our benchmark dataset columns was completely lost. Therefore, I decided to create the contest data one more time, but now using the above-mentioned algorithm for more careful organization of rows during the data load.

The result is attached to the latest post on Rough Data Contest forum thread. You can see that the number of non-trivial min/max values is now much higher. It turns out that organizing data in purpose of improving Knowledge Nodes can bring us also other positive effects. In particular, it leads me back to our presentation at VLDB 2008. One of the questions from the audience was: Aren’t you afraid of losing correlations between columns when grouping data into Data Packs? So the answer should be: Yes, there is such a risk but we can attempt to cope it by more careful organization of rows during data load.

Best greetings,

Dominik

Infobright     Tags: