Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
Approximate Querying
Posted: 22 January 2009 10:17 PM   Ignore ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello,

Isn’t it just a matter of time when database users will have to accept approximate queries?

I open this thread in relation to the recent blog:

http://www.infobright.org/Blog/Entry/approximate_querying/

I’d be grateful for everyone’s opinion!

Best greetings,

Dominik

[ Edited: 22 January 2009 10:58 PM by Dominik Slezak]
Signature 
Profile
 
Posted: 26 January 2009 09:40 AM   Ignore ]   [ # 1 ]  
Newbie
Avatar
Rank
Total Posts:  2
Joined  2009-01-23

I agree. The users do not need exact answers, but they do not know this yet.

It is not wise to insist on exact answers, when the data used is not exact. In a data warehouse one has a picture of the data from the point in time when the data were loaded. So the exact data is never available for analytical querying. What is the use of exact answers based on inexact data? None.

Furthermore, one must be aware that inexact data can be collected much faster. Note, the public opinion polls in Poland where 1000 opinion out of 30 million allow achieving good approximation, while 2000 people’s answer gives a very good estimate (as professional pollers admit).

An interesting paper on limiting the query run-time was published by VLDB: http://www.vldb.org/conf/2007/papers/industrial/p1207-hu.pdf.  It could be an interesting starting point of discussion on inexact answers.

Profile
 
Posted: 26 January 2009 01:30 PM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  721
Joined  2008-08-18

Definitely!

Consider the number of database, query and reporting technologies that support some form of data sampling.  With appropriate parameters for confidence levels and margins of error, these options are often accepted with the acknowledgment that the questions that are often asked of many data warehouses and analytical databases can be sufficiently answered with approximate answers.

Heuristically, it’s just a starting point.  Drill-down to exact answers in the details is then guided by the (rapid) approximate responses.

Signature 
Profile
 
Posted: 26 January 2009 01:55 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello Krzysztof and David,

First of all, Krzysztof, thanks a lot for the link! When I studied VLDB 2007 materials, I didn’t notice this particular paper. I’ll read it carefully and I’ll follow up with some comments soon.

I’m very glad that there is some interest in approximate querying. The sampling example is a good case study. It’s interesting how to design a good sampling strategy for (ad hoc) queries with diversified conditions, aggregations, et cetera. It’s further interesting, David, what you’re writing about user-database interactions and how it may relate to manipulations with sampling.

I hope that some of the ideas how to use knowledge grid with this respect will turn out as helpful. Now I’m feeling additionally motivated to write it down and present for further discussion.

With best greetings and thanks again,

Dominik

Signature 
Profile
 
Posted: 30 January 2009 05:20 AM   Ignore ]   [ # 4 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

There is also a relevant discussion in another thread:

http://www.infobright.org/forums/viewthread/476/

Best greetings,

Dominik

Signature 
Profile
 
Posted: 02 February 2009 01:37 AM   Ignore ]   [ # 5 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

Last Friday I attended a conference at MIT.  One of keynote speakers, Alon Halevy, heads the “Deep Web” search initiative at Google. The Shallow Web consists of all Web content (about 5 million pages) that you can search with Yahoo or Google. The Deep Web is everything else, including data behind the forms; it is estimated to be 500 times the size of Shallow Web.  Deep Web search provides a set of emerging business use cases (or vertical applications) for approximate query.

Alon Halevy listed several analytical database application challenges:  schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, data level synonyms discovery. 

Semantically and algorithmically, the rough set theory appears to match this application domain perfectly.  The biggest challenge is to come up with ways to formulate such queries without SQL language extensions.

Signature 
Profile
 
Posted: 02 February 2009 07:49 AM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello Alex,

Thanks for joining the discussion. It’s great that you could listen to Alon Halevy. I’ve never met him although I could read a couple of papers he co-authored. He must a be a very good speaker with plenty of brilliant ideas. I’m very glad that he has referred, among the others, to this particular topic. I’ll read your blog about New England Database Day at MIT with a huge interest. It must have been a very inspiring event!

Regarding your comments:

Alex Esterkin - 02 February 2009 01:37 AM

The rough set theory appears to match this application domain perfectly. The biggest challenge is to come up with ways to formulate such queries without SQL language extensions.

Indeed, as discussed in the rough set thread (http://www.infobright.org/Forums/viewthread/113/), one of the original goals while introducing the rough set theory more than 25 years ago was to provide an efficient methodology for dealing with approximations computed from data. The way we adapted the fundamental rough set principles at Infobright enables us to take an advantage of this. On the other hand, as you noted, there is also a challenge at the level of SQL syntax. As partially discussed in yet another thread (http://www.infobright.org/forums/viewthread/476/), SQL extensions seem to be more or less natural in various specific cases. So I guess we should look at such specific cases instead of attempting to seek for a complete solution (which may be hard and practically not necessary). Certainly, the scenarios where no extensions are needed would be the best ones. On the other hand, we should also remember about extending the format of the results of SQL queries, which seems to be a separate task…

Best greetings,

Dominik

Signature 
Profile
 
Posted: 02 February 2009 08:38 AM   Ignore ]   [ # 7 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

Hello, Dominik,

Yes, indeed, extending query result format is another aspect.  Perhaps, rough query applications related to “Deep Web” search or semantic search can be implemented without query result format extensions; however, as you point out, in case of approximate converging query, such extensions appear almost unavoidable.

Another interesting question is whether the traditional single round trip request-response query execution protocol is sufficient for rough query applications of particular type. 

In case additional intra-execution user input is not required, the query can be simply intercepted by the database server, interpreted in a “rough query” way, and then executed without the constraints of traditional single-roundtrip request-response.

Otherwise, even the client-server protocol may have to be extended.  The good news is that client server protocol extensions are likely to emerge anyway.  For example, the Drizzle community is working on such extensions. Perhaps, some of the already emerging protocol extensions match our needs.


Best greetings,

Signature 
Profile
 
Posted: 19 February 2009 10:13 AM   Ignore ]   [ # 8 ]  
Member
RankRankRank
Total Posts:  230
Joined  2008-12-03

Hi Dominik,

I was reading your discusion about approximate queries and I wanted to bring to your atention my comments in thread
Problem executing a Query
Please take look and see if this make sense to you.
thanks
C_George

Profile
 
Posted: 19 February 2009 11:59 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello C_George:

C_George - 19 February 2009 10:13 AM

I was reading your discusion about approximate queries and I wanted to bring to your atention my comments in thread
Problem executing a Query
Please take look and see if this make sense to you.

Thanks a lot for bringing this topic here!

Actually, I was reading the discussion in the Problem executing a Query thread (http://www.infobright.org/Forums/viewthread/454/#2051), wondering how to respond.

Indeed, it’s possible to link together two potential applications of rough/approximate query:

1. External usage—extending SQL syntax with rough/approximate query-related functionality

2. Internal usage—using rough/approximate query mechanism to speed up the standard queries

One can think about our knowledge grid query optimization/execution mechanisms as based on rough/approximate query. Certainly, when designed for internal usage, it doesn’t need to be so sophisticated with regards to, e.g., the query syntax. Nevertheless, these are straightforward examples where internal rough queries are used, with no need of accessing data:

1. Splitting data onto relevant / suspect / irrelevant

2. Cardinality and selectivity estimation

3. Using it while optimizing the ordering of joins etc.

4. Optimizing the ordering of packs to be processed

5. [...]

Although—as I’ve already written—there’s no need of so clearly and completely defined syntax at such an internal level of application, I often do a kind of exercise and try to express such examples within pretty informal ROUGH QUERY framework, where the rows are replaced by packrows and the rows’ values are replaced by the knowledge grid statistics.

It should be additionally mentioned that such internal rough queries are used by the engine not only at the beginning of query optimization but also during the query execution, basing on some intermediate results obtained while ongoing data processing.

We’ll keep developing the new ways of using such rough methods and rough queries in query optimization and execution.

There’s still a lot to be done!

Best greetings and thanks again,

Dominik

Signature 
Profile
 
Posted: 21 February 2009 07:07 PM   Ignore ]   [ # 10 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-01-29

I appended my most recent comments from closely related thread “IB Specific SQL Syntax” http://www.infobright.org/forums/viewthread/476/ concerning leveraging rough queries along with “exact queries.


——-

Dominik et al,

I have been focused on other commitments and did not have chance to return to the discussion of rough queries. FYI - Some of the most interesting challenges that I confront are the definition and counting of distinct entities - i.e. customers and visits - as well as related entities - i.e. similar customers and similar visits.  Accordingly I would like to investigate using rough queries to facilitate novel analytics applications.

As you outlined, the extension of IB to rough queries would need to consider the following and I provided my suggestions:

1. Extending SQL syntax:  As with other RDBMS catalogs, one approach is to allow querying of the IB statistics from the KN grid. Alternatively, in line with the paper, introduce a new IB-specific keyword “rough” as optional parameter within select statement syntaz - e.g. SELECT [ROUGH] -that allows only subset of SQL aggregations available in the KN - i.e. Sum, Max, Min - with the parameter having local context so that correlated subqueries are considered exact queries by default. Either approach will allow users to combine rough queries with exact queries, opening many potential avenues using breadth of SQL syntax to leverage IB’s KN statistics.

2. Extending the structure of query results:  For now retain the structure of query results

3. Extending the engine capabilities to take the full advantage of knowledge grid in order to execute rough queries: Keeping with outside-in thinking, remain close to the community and identify common patterns of SQL-based rough querying that could be automated or refined via further enhancements to IB-specific SQL extensions e.g. rough variants of mathematical functions.

I did also read the other thread on approximate queries you referenced- http://www.infobright.org/Forums/viewthread/454/. I believe these suggestions align closely to what was described in the that thread as well.

Best regards,

Profile
 
Posted: 21 February 2009 08:28 PM   Ignore ]   [ # 11 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-01-29

A few comments illuminating the potential applications for applying sampling techniques - whether random or statistical techniques:

1) Sampling enables a broader spectrum of business needs within the arena of testing strategy and risk minimization beyond those of analytic performance and responsiveness. For example, several of business issues necessitating sampling include clinical trials, advert testing, comparative testing - all of which incorporate test vs control.  Thus, the functionality for sampling (whether integral to one component or integrated across the technology tier) ideally must address partitioning the selected sample further into N subgroups not only 2-cell (most common) but also 3-cell, 4-cell, etc. 

2) In context of sampling customers, we might think that a set of datapacks could represent the set of sampled customers. For instance, when the customer data set is organized as required for MPP - i.e. partitioned by customer id - and transaction row_count per customer follows typical frequency patterns - then all the transactions for each customer would reside in much less than FFFF rows of a single datapack. The transactions of first and last customers found in one datapack may be continued from the previous datapack or continue into next data pack. However, the proportion of each data pack required per average customer would depend on the average customer transaction row_count. In order to sample from customers within the datapack, we might elect to omit the first and last customers - identified by the MIN(customer id) and Max(customer id). More robustly, when datapack N is selected as part of a sample, we might instead also select the datapacks N-1 and N+1 corresponding to datapack N. Thus, we have two alternative mathematical approaches to sampling customers using datapacks.


In order to enable sampling on customer ids or otherwise, we would need access to enumerated datapacks - i.e. DP num - analogous to enumeration of rows in RDBMS - i.e. row num. Thus, we might apply the sampling constraint defined by list of numbered datapacks as part of the relevancy critieria for datapacks. As such, I would expect disk I/O if sampling X% of datapacks would be approx. linear with disk I/O if not sampling datapacks.


Best regards,

[ Edited: 21 February 2009 08:39 PM by BInnovator]
Profile
 
Posted: 22 February 2009 05:02 AM   Ignore ]   [ # 12 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello BInnovator,

Thanks a lot for your very thoughtful comments! I’m not kidding - you really helped me. Your comments show how the whole idea could be introduced more as a sequence of steps discussed interactively with the community rather than turning the code upside down without any guarantee that the outcome will match expectations.

BInnovator - 21 February 2009 07:07 PM

Extending SQL syntax:  As with other RDBMS catalogs, one approach is to allow querying of the IB statistics from the KN grid.

This is something we should do sooner than later. It may provide a lot of value! It doesn’t interfere with any core mechanisms that would make implementation more difficult. Actually, it looks like related also to the blog entry about rough data analysis:

http://www.infobright.org/Blog/Entry/rough_data_contest/

I believe that there is a huge potential in analyzing the Knowledge Grid, either by SQL or data mining algorithms. It may need further discussion how to formulate queries and interpret results. Starting with the min/max KNs, it may be interesting to learn together how to query against more advanced structures. It may be inspiring also for Infobright! - The KN querying strategies discussed with the community may lead us to some new hints how to optimize and execute the exact queries inside our engine.

BInnovator - 21 February 2009 07:07 PM

Alternatively, in line with the paper, introduce a new IB-specific keyword “rough” as optional parameter within select statement syntax - e.g. SELECT [ROUGH] - that allows only subset of SQL aggregations available in the KN - i.e. Sum, Max, Min (...).

Indeed, we don’t need to implement the “rough modifier” for all the aspects of SQL syntax at once! By doing it step by step, we’d be able to focus on the most practically useful areas. Surprisingly (or not surprisingly at all), such areas may turn out the easiest ones for internal implementation. Certainly, more design and development effort will be needed here. Nevertheless, such a step-by-step approach may allow us for introducing some rough elements far sooner into the future technological roadmap.

Perhaps it’s a good moment to refer to one more blog entry, especially its 4th paragraph:

http://www.infobright.org/Blog/Entry/approximate_querying_2/

I’m looking at “inexact KNs” as one more alternative. Just like in the case of some of your above ideas, such an extension does not require changes in the core modules of query optimization and execution. Hence, it should be relatively easy to proceed with experiments and - if the results are promising - to introduce some new elements to our future roadmap. Also, there is a chance that we’ll be able to prepare the ICE code well enough to let the community contribute with such new KN structures. I will surely write more about it soon.

BInnovator - 21 February 2009 07:07 PM

Extending the structure of query results:  For now retain the structure of query results

If the community could accept such a solution, I would be very happy. I think this is the only reasonable approach for now.

BInnovator - 21 February 2009 07:07 PM

Extending the engine capabilities to take the full advantage of knowledge grid in order to execute rough queries: Keeping with outside-in thinking, remain close to the community and identify common patterns of SQL-based rough querying that could be automated or refined via further enhancements to IB-specific SQL extensions e.g. rough variants of mathematical functions.

It’s Sunday morning in Warsaw and I’m slowly running out of time. I’ll continue replying to your ideas hopefully later today. Rough variants of mathematical functions is one of the very hot topics at Infobright now. Also, I need to think more about sampling…

Best greetings and thanks one more time,

Dominik

Signature 
Profile
 
Posted: 22 February 2009 07:26 AM   Ignore ]   [ # 13 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  602
Joined  2008-08-18

Hi,

Some loose (rough?) thoughts.

Speaking about rough queries, it is always a problem with developing a versalite and useful semantics of th query result. Sometimes it is quite obvious what we may mean as a result of rough version of query:

ROUGH SELECT count(*) FROM t1 WHERE a>12;

+-----------------+---------------+
min(count(*))   | max(count(*)) | 
+-----------------+---------------+
3503            12714         |
+-----------------+---------------+ 

We may interpret it as the best approximation of lower and upper bound of the result. But how to define a rough result of join? Grouping? “Select *”?

I think the only way to maintain versality, “pluggability” and scalability of rough querying, from a perspective of new KNs, new algorithmic ideas, new user needs etc. is to outline a basic “language” of rough parameters, and to present the query result in a “parameter - value” manner:

ROUGH SELECT amin(bFROM t1 join t2 on t1.key t2.key WHERE a>12 AND c<28 GROUP BY a;

+--------------------------+--------------+
rough_param              rough_value  
+--------------------------+--------------+
min_num_of_result_rows   6            |
max_num_of_result_rows   325          |
min_value_of_col_1       13           |
max_value_of_col_1       1295         |
min_value_of_col_2       0            |
max_value_of_col_2       139913       |
min_num_of_pack_opened   340          |
max_num_of_pack_opened   340          |
min_cardinality_of_group 1            |
max_cardinality_of_group 7500         |
+--------------------------+--------------+ 

Such result may be analyzed automatically by external applications. If a rough result is not applicable to a query of some type, it will be just omitted in the result. If someone would like to implement e.g. an average count(*), or a most probable result, or even a set of values which may occur in the result - then the mechanism of presenting results will be ready.

Regards,

Signature 
Profile
 
Posted: 22 February 2009 01:35 PM   Ignore ]   [ # 14 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

I think, the best way to narrow down the semantics of rough querying is to make a short list of fundamental use cases.  These use cased have to come from real application needs.

For exampe

Real Application Need #1.

Form fitting using at most N attributes.  As you increase N, the degree of roughness decreases.  This kind of rough query situation may be applied to event processing, satellite surveillance, event processing, fraud detection, signal processing.


Real Application Need #2.

Deep Web Search (the server side of “semantic web”).  According to Dr. Alon Halevy, who heads the “Deep Web” search initiative at Google,  there are the following approximate data processing use cases:  schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, data level synonyms discovery.

Hypothetical Application Need #3.

Iterative Querying.  Database query is always a single request-response roundtrip.  You can generalize the client server protocol to be allow for multiple roundtrips, each returning a rough result.

Note that MySQL provides a container for syntax extensions, allowing SELECT /*! some_code */ ... syntax go through.  Thus, you can have something like this:

SELECT /*! ROUGH(PERCENTILE=‘75%’, EXTRAPOLATION_FIELD=‘year’,EXTRAPOLATION_VALUE=‘6’) */  AVG(salary + bonus) FROM salarydata WHERE years_experience BETWEEN 5 AND 10 AND industry = ‘Finance’ AND job_title = ‘IT Manager’ AND metropolitan_area = ‘New York’;

[ Edited: 22 February 2009 02:03 PM by Alex Esterkin]
Signature 
Profile
 
Posted: 23 February 2009 07:50 AM   Ignore ]   [ # 15 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello All:

It’s a pleasure to reply to so many posts, coming from both community members and Infobright employees!

Alex Esterkin - 22 February 2009 01:35 PM

Iterative Querying.  Database query is always a single request-response roundtrip.  You can generalize the client server protocol to be allow for multiple roundtrips, each returning a rough result.

Alex, this is my favorite item on your list. I treat iterative querying as something between classical querying and machine learning. In classical querying, the user is fully responsible for formulating a good query. In classical machine learning, the user can hardly control the process. There’s a growing interest in iterative / interactive methods of data analysis. In (not only) my opinion, it is the future of data mining. It may be supported by, e.g., an iterative dialog between the user and the database. At the very beginning of the dialog, the user should be allowed to quickly look around the data (fast approximate querying). Then, gradually, the data investigation may become more precise…

Jakub Wroblewski - 22 February 2009 07:26 AM
+--------------------------+--------------+
rough_param              rough_value  
+--------------------------+--------------+
min_num_of_result_rows   6            |
max_num_of_result_rows   325          |
min_value_of_col_1       13           |
max_value_of_col_1       1295         |
min_value_of_col_2       0            |
max_value_of_col_2       139913       |
min_num_of_pack_opened   340          |
max_num_of_pack_opened   340          |
min_cardinality_of_group 1            |
max_cardinality_of_group 7500         |
+--------------------------+--------------+ 

I’m not sure. I’d rather keep the result format as if it was a classical case. How about answering with some expected values and reporting “rough variance” as a separate outcome? Such an approach might be easier for integrating with external tools. On the other hand, we might introduce Infobright specific functions ROUGH_MIN and ROUGH_MAX that could be used as follows:

SELECT aROUGH_MIN(min(b)), ROUGH_MAX(min(b)) ... 

I believe that such an approach would be consistent both with the classical query/result syntax and with the above suggestion.

Jakub, continuing with your ideas, there are of course multiple aspects of pluggability of new KNs including, e.g., their creating, storing and using by the engine. In the case of inexact KNs that I mentioned in my last forum/blog posts, I’d add an extra piece of functionality that would enable KNs to inform the engine about their degree of accuracy. I’m fully open for further discussion!

(For those who don’t know, Jakub and I sit in the same room. But we’ll surely keep using this thread for approximate ideas!)

BInnovator - 21 February 2009 07:07 PM

Extending the engine capabilities to take the full advantage of knowledge grid in order to execute rough queries: Keeping with outside-in thinking, remain close to the community and identify common patterns of SQL-based rough querying that could be automated or refined via further enhancements to IB-specific SQL extensions e.g. rough variants of mathematical functions.

As I wrote yesterday, rough variants of mathematical functions are now a popular topic at Infobright. Not so long time ago, the queries including functions were switching to MySQL. Now it’s not the case any more. Moreover, we keep re-architecturing the engine to let the functions’ outcomes be processed as if they were standard columns. One of the most important undertakings ahead of us is to extend such uniformity of processing onto the Knowledge Grid. In orider to do this, we need to come up with a framework enabling us to compute KNs of the functions’ values from KNs of the functions’ arguments. It is needed for efficient execution of both approximate and classical queries. I honestly hope that we’ll come up with an architecture enabling external contributions. Otherwise, we may spend quite a lot of time implementing rough versions of all important functions. grin

BInnovator, I still need to think more about your comments about rough sampling. I believe that, thanks to the Knowledge Grid, we might come up with a better shape of the “accuracy curve” than in the case of classical sampling. I mean that, in our case, taking the same percentage of data would provide more accurate results. For now, however, it’s not proved yet. Your comments may be very useful with this respect.

Thanks!!!!!

Dominik

Signature 
Profile
 
   
1 of 2
1