Joinutility seperatorLogin utility separator Infobright.com
   
 
Information Schema For Knowledge Grid
Posted: 02 March 2009 04:35 AM   Ignore ]  
Newbie
Rank
Total Posts:  1
Joined  2009-01-22

Hi,

It would be great to have an ‘information_schema’ type set of tables for the knowledge grid so that we could easily view the status of histograms, character maps and pack to pack knowledge nodes.

Thanks!

Matthew Wooller
Sulake Corporation

Profile
 
Posted: 02 March 2009 06:49 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Matthew,

This is a very good suggestion. One of our ongoing objectives is to make knowledge grid more “visible”. As an example, we have some commands showing usefulness of knowledge nodes in queries. (I’m sure one of my colleagues will follow up this topic.) As another example, there was recently discussion on querying the knowledge grid in a kind of catalog style (see the first out of two BInnovator’s posts on Feb 21 at http://www.infobright.org/index.php/Forums/viewthread/454/P0/). I think that your suggestion to improve information about the status of knowledge nodes (e.g., which of them are created and to what extent they are updated) is one more important example in the “visible knowledge grid category”.

Thanks and best greetings,

Dominik

Signature 
Profile
 
Posted: 02 March 2009 10:49 AM   Ignore ]   [ # 2 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

Hello, Matthew,

Do you have any particular requirements in this respect?  What kind of details would you find useful? 

Another question. As you know, catalog schema tables contain several different kinds of data:

* Metadata created administratively or via DDL,  which describes database schema objects - tables, views, columns, constraints, tablespaces, etc.
* Metadata created administratively or via DDL, which describes data objects themselves - users, privileges, etc.
* Database information that is collected and compounded from time to time: data cardinality, column statistics, storage utilization, etc.
* Real-time information for database administrators - user sessions, locks, system utilization, internal database object state.

What kind of Knowledge Grid information would be useful for you from this angle?

Thank you so much for you input!

Best regards,

Signature 
Profile
 
Posted: 02 March 2009 11:21 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

There are several tricks to take a look inside KNs. A roughstats command may be used as an alias of the first column:

SELECT COUNT(*) AS roughstats FROM t1

The query itself works as usual, but on the server console (or *.err log file) the statistics of the first used table will be displayed in the following form:

-------- RSIndices for trans (tab.no371,  17 packs) ----------
Name               Triv.packs Span    Hist.dens.
----------------------------------------------------------------
trans_id            0         99.99%  73.31%
account_id          0         99.94%  69.78%
tdate               0         5.88%   14.19%
(...)
bank                0         94.11%  100%
account             0
-------------------------------------------------------------------
RSI_PPcolaccount_id  joined with tab365col0    dens100%
RSI_PPcolaccount_id  joined with tab369col2    dens100%
RSI_PPcoltdate    self-joined with coltdate     dens100%
RSI_PPcolamount   joined with tab373col2    dens100%
------------------------------------------------------------------- 

Note that to see these statistics, “ControlMessages = 2” must be set in brighthouse.ini. The following information is displayed:

  * the name of the table and its number (used for identifying KN files),
  * the number of packs,
  * each line describes one column:
    —its name
    —number of trivial packs, i.e. containing nulls only, or the same numerical value for each row,
    —average span, i.e. “width” of the numerical pack (max. value minus min. value), comparing to the overall span of values in this column,
    —histogram/CMAP density (if one exists), i.e. percentage of 1’s in histograms.
  * The list of all Pack2Pack nodes, together with their density (% of ones), for any column from the table, joining with any other table/column. Note that there is a column name displayed only for self-joins (the table to itself).

Statistics of histograms and span are calculated for nontrivial packs only.
In general, small % values are better. It may be considered as a probability that a random value will not be excluded by min/max or other KN (histogram/CMAP).

The example above means that all 17 packs are nontrivial and each pack is included in an interval covering (in average) 5.8% of the whole domain of tdate and each such interval is additionally scattered such that only 14% of its values are actually present. Thus, if we have random condition “tdate = x”, where x is somewhere between minimal and maximal value of tdate, then we have 100% - (5.88% * 14.19%) = 99.1% chance that this condition will be resolved by either DPN or KN, without touching the data (for a particular pack). I.e. probably just one pack will be opened by such query.

Another method to display statistics is roughattrstats alias:

SELECT col AS roughattrstats FROM tab LIMIT 1

It displays information about the selected column, separately for each data pack:

-- examples from TPC-H 1 GB database

select l_orderkey 
as roughattrstats from lineitem limit 1;

Column 0table 71 
Pack       Rows  Nulls                  Min                  Max  Hist

-------------------------------------------------------------------------------
0         65536      0                    1                65380  1024/1024
1         65536      0                65381               130530  1024
/1024
...
90        65536      0              5896963              5962531  1024/1024
91        37439      0              5962531              6000000  1024
/1024
-------------------------------------------------------------------------------


select p_container as roughattrstats from part limit 1;

Column 6table 66 
Pack       Rows  Nulls                  Min                  Max  Hist

-------------------------------------------------------------------------------
0         65536      0                    0                   39  n/a
1         65536      0                    0                   39  n
/a
...
3          3392      0                    0                   39  n/a
------------------------------------------------------------------------------- 

Displayed information (numerical column) involve column and table internal number, pack number, number of rows in the pack, number of null values in the pack, minimal and maximal value (internally encoded), and histogram density. The density is displayed as ‘x/y’, where x is a number of ones in histogram (the value may occur in data pack), and y is a total number of histogram positions. Default value 1024 means that the interval for the data pack is wider than 1024, so histogram is approximate. Value ‘n/a’ means that the histogram is not available for some reason.

Another example:

select vin as roughattrstats from fact_sales limit 1;

Column 0table 140 
Pack       Rows  Nulls        Min        Max  CMap 
-------------------------------------------------------------------------------
0         65536      0   0AAHL653   8ZZYU286  9-26-26-26-26-10-10...
1         65536      0   0AAFF860   8ZZUM680  9-26-26-26-26-10-10...
...
15258     51712      0   0AABB747   8ZZHU603  9-26-26-26-26-10-10...
------------------------------------------------------------------------------- 

String columns are described by minimal and maximal value (8 first characters, which may define a common prefix), and CMap statistics. The latter is displayed as a series of numbers indicating a number of different characters on consecutive string positions (example above: only 9 character on the first position, probably digits ‘0’-‘8’). Up to 7 positions of CMap is displayed. If a common prefix is found, the CMap starts on the first character after prefix.

Regards,

Signature 
Profile
 
Posted: 02 March 2009 11:22 AM   Ignore ]   [ # 4 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

A small addition:

The roughattrstats switch takes into account also WHERE clause. It is possible to see how much rows were filtered out by WHERE condition for every data pack. Note that min/max/null/KN information is still global (for the full pack, not the filtered one).

The statistics may not be displayed if no rows is returned from query, or aggregations (GROUP BY, COUNT etc.) are in use. Thus the best form of query is:

SELECT col AS roughattrstats FROM tab1 [WHERE ...] LIMIT 1

LIMIT 1 will prevent unnecessary displaying too large result.

Regards,

Signature 
Profile
 
Posted: 02 March 2009 03:06 PM   Ignore ]   [ # 5 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

In case of rough attribute statistics, the question is whether it is kept in information schema tables or calculated for the given user request.

The information schema based option would mean that one or more tables are created in the information schema and that users and sysadmins will be able to query these tables as any other information schema tables:

SELECT ....... FROM roughattrstats WHERE table_name='ibtab1' and attr_name='col1'

For rough stats calculations performed for the given user request, another variation of Kuba’s idea is to have a built-in function, e.g., ROUGHATTRSTATS(col1).  In this case, the LIMIT clause will be unnecessary, as you can treat it as an aggregate function.

SELECT ROUGHATTRSTATS(col1FROM tab1 [WHERE ...
[ Edited: 02 March 2009 03:18 PM by Alex Esterkin]
Signature 
Profile
 
Posted: 03 March 2009 01:55 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  11
Joined  2009-01-29

Hi all,

As Alex suggests, it certainly would be beneficial to treat “roughattstats” as a statistics table such that the user can generate a result set by projecting columns and selecting rows with each referencing one or more of any of the statistics table’s columns: table, column, pack, rows, nulls, min, max, (as well as hist or cmap that might return null value for any table.column having incompatible data types).

By enabling user access to the min and max values of an attribute grouped by data pack, one logical progression would be to attempt to leverage the knowledge to winnow the result set by filtering pack #‘s based on the attribute’s values within each pack’s span. Several use case scenarios involving multi-level queries blending both rough and exact nature come to mind. I described one involving sampling under the posting “Approximate Querying”.

Thanks.

Profile