Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

11
Feb

Approximate Querying (2)

Dominik Slezak's photo
by Dominik Slezak     Wed, Feb 11, 2009

One of my favorite bands titled their “best of” album “forever delayed”. (Guess who!) It’s close to my heart as I’m forever delayed with everything: papers, projects, blogs...

Queries are forever delayed too. It’s always better to have an answer sooner than later. An experienced user won’t be surprised with more complex queries taking more time. In some situations, however, the answer is needed “right now”, or at least not too late.

This is actually one more reason for considering approximate querying. Recently, a paper about it was referenced on our forums. In the paper, the authors propose a framework for time-constrained SQL, wherein a user issues a query as before but additionally provides nature of constraint (soft or hard), an upper bound for query processing time, and acceptable nature of results (partial or approximate). As I wrote before, all these aspects may be redesigned for Infobright, with our ROUGH QUERY (see one more forum thread) as an initial stage of approximation, and with the time constraints expressed by the maximum number of data packs (dynamically chosen by Knowledge Grid during execution) that can be additionally decompressed and processed in order to make the rough results crisper.

We also intend to investigate another approach, perhaps too crazy but also too tempting to give it up. It’s about making our Knowledge Grid “even more rough”. When you read, e.g., our VLDB paper, you may notice how much depends on min/max values computed for particular data packs and stored in the Knowledge Grid. So, imagine we have an integer column and a data pack with min value 100 and max value 500. Imagine 99% of values in this pack are between 200 and 250. So how about cheating a little bit and putting 200/250 instead of 100/500 into the Knowledge Grid? If we do it, the considered pack will be accessed less frequently and the results of queries will be “almost” the same. If we do it for more data packs, the average speed of queries may increase significantly. Certainly, it can be considered also for other types of knowledge nodes. Equally certainly, there are quite a few additional challenges here:

Firstly, it’s an interesting task how to create such inexact knowledge nodes intelligently. One may consider a number of optimization heuristics, which – given a data pack as an input – will “improve” its statistics in order to minimize the frequency of decompression and, in the same time, to minimize the average degree of imprecision in the results of queries, given the pack is not accessed. If you still remember my blog about data mining, you may consider it as one more area of using the algorithms extracting knowledge from data in Infobright.

Secondly, it’s crucial to estimate errors occurring at the level of particular data packs and to propagate them through the whole query execution process in order to provide the users with enough information about the overall expected errors in the results of queries. This is how we could employ the whole idea to achieve a practically useful framework.

Luckily, the two above challenges can be addressed independently. I realize that a good mechanism for the error propagation will require additions to the core engine. Given Infobright’s product roadmap, it may take us a long time before we can tackle it. On the other hand, however, the algorithms producing inexact knowledge nodes can be kept more “outside” the core architecture. It should be relatively easy for an experienced programmer to plug such algorithms into the code responsible for creating knowledge nodes during the data load. It’s also relatively easy to “convince” the Infobright engine to use new knowledge nodes instead of the old ones. Hence, at least at the very first stage of research, the engine may use new knowledge nodes not even realizing they’re not as accurate as they used to be.

It provides the basis for interesting cooperation between the Community and Infobright team with regards to investigating the most efficient knowledge structures aiming at approximate querying. I hope that some of you will be interested in such cooperation!

It looks like the right step into the future...

Best greetings,

Dominik

Infobright     Tags:
Please login or register to post a comment.