Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

21
Feb

Rough Data Contest

Dominik Slezak's photo
by Dominik Slezak     Sat, Feb 21, 2009

One may say that Infobright Knowledge Grid is a new kind of data.

Imagine a data table with 1 billion rows. In ICE, it corresponds to 15259 rough rows. Each rough row groups together 65536 rows (the last group is smaller). Physically, each rough row is split into data packs corresponding to particular attributes, assuring both horizontal and vertical data decomposition. Logically, each rough row can be treated as a row in a new rough table, where the attributes’ values correspond to the data pack statistics. Comparing to the original data, the rough table has 65536 times less rows, the same number of attributes, but more compound values. For example, given 10 numeric attributes in the original data, we’ll now have 10 rough attributes, wherein every rough attribute labels every rough row with an interval value (min/max values within the corresponding data pack), which can be further extended by, e.g., a binary histogram (encoding the holes in the min/max ranges).

One may say that Infobright analyzes such rough tables while optimizing and executing queries. (Here we talk about standard, precise queries. Approximate queries are still the future.) One may say that “the third face of data mining” described in one of previous posts is about adding more types of rough attributes to rough tables. Last but not least, one may say that rough tables can be useful not only in database scenarios. How about, e.g., rewriting some data mining or visualization algorithms to work on rough tables instead of original data? How much speed would be gained? What about the quality of data mining results and precision of data visualization? Would it be possible to integrate, eventually, the rough and exact computation levels, like we did in Infobright?

Certainly, these are questions too difficult to be answered by a single person or a single research group. Hence, we started considering one more academic contest this year. I have a feeling that the best venue for a rough data contest would be a rough set event. I’m in touch with the organizers of the international rough set conference in Delhi, December 16-18. I’ll know more details pretty soon. Actually, we’ve already created a nicely formatted rough table for one of our favorite 1 billion rows data sets. I’m sure that discussing together the results of such rough data analysis will be greatly inspiring for everyone!

Best greetings,

Dominik

Infobright     Tags:
Please login or register to post a comment.