Joinutility seperatorLogin utility separator Infobright.com

Navigation

 ·   Wiki Home
 ·   Wiki Help
 ·   Categories
 ·   Title List
 ·   Uncategorized Pages
 ·   Random Page
 ·   File Upload
 ·   Uploaded Files
 ·   Recent Changes
 ·   RSS
 ·   Atom
 ·   What Links Here

Active Members:

Search:

 

Create or Find Page:

 

View Knowledge Grid Statistics

This topic describes internal tweaks which may help you use Knowledge Nodes (KNs) more efficiently.

Note that this is not part of the official ICE capabilities, rather some internal flags, and as such is not part of the standard QA regression test plan. Use at your own risk. Note also that these commands will display their output not on MySQL console, but either on a server console (if a proper command line option is set) or in bh.err file – in the same place as the execution log is stored.

The first tweak is called roughstats. It is used as an alias of the first column:

SELECT COUNT(*) AS roughstats FROM t1;

The query itself works as usual, but on the server console (or *.err log file) the statistics of the first used table will be displayed in the following form (the example is for transactions table from a financial database):

-------- RSIndices for transactions (tab.no. 371, 17 packs) ------

Name Triv Packs Span Hist Dens
———————- ———————- ———————- ———————-
trans_id 0 99.99% 73.31%
account_id 0 99.94% 69.78%
tdate 0 5.88% 14.19%
(…)
bank 0 94.11% 100%
account 0
———————- ———————- ———————- ———————-

-------------------------------------------------------------------

The following information is displayed:

  • The name of the table and its internal number, used e.g. for identifying KN files,
  • The total number of packs (the same for each column, of course),
  • Each line contain information about one column:
    —Its name
    —Number of packs which are trivial, i.e. contain nulls only, or the same numerical value for each row (homogenous),
    —Average span, i.e. “width” of the numerical pack (max. value minus min. value), comparing to the overall span of values in this column,
    —Histogram/CMAP density (if one exists), i.e. percentage of 1’s in histograms.

Statistics of histograms and span are calculated for nontrivial packs only.
In general, small % values are better. It may be considered as a probability that a random value will not be excluded by KN (min/max or histogram/CMAP).

The example above (for column tdate) means that all 17 packs are nontrivial and each pack is included in an interval covering (in average) 5.8% of the whole domain of tdate and each such interval is additionally scattered such that only 14% of its values are actually present. Thus, if we have random condition “tdate = x”, where x is somewhere between minimal and maximal value of tdate, then we have 100% – (5.88% * 14.19%) = 99.1% chance that this condition will be resolved by KNs, without touching the data (for a particular pack). I.e. probably just one pack will be opened by such query.

Another method to display statistics is roughattrstats alias:

SELECT col AS roughattrstats FROM tab LIMIT 1;

It is more detailed than the previous one. It displays information about just onee selected column, but separately for each data pack:

-- examples from TPC-H 1 GB database

select l_orderkey as roughattrstats from lineitem limit 1;

Column 0, table 71

Pack Rows Nulls Min Max Sum Hist
—————— —————— ————— ——————- ——————- —————————
0 65536 0 1 65380 2138859427 1024/1024
1 65536 0 65381 130530 6419413334 1024/1024
2 65536 0 130531 196230 10703511680 1024/1024
90 65536 0 5896963 5962531 388616223147 1024/1024
91 37439 0 5962531 6000000 223932182133 1024/1024
—————— —————— ————— ——————- ——————- —————————

select p_container as roughattrstats from part limit 1;

Column 6, table 66

Pack Rows Nulls Min Max Sum Hist
—————— —————— ————— ——————- ——————- —————————
0 65536 0 0 39 1278387 38/38
1 65536 0 0 39 1278387 38/38
2 65536 0 0 39 1278387 38/38
3 3392 0 0 39 64747 38/38
—————— —————— ————— ——————- ——————- —————————

Displayed information (numerical column) involve column and table internal number, pack number, number of rows in the pack, number of null values in the pack, minimal and maximal value (internally encoded), sum of numerical values of all rows in pack, and histogram density. The density is displayed as ‘x/y’, where x is a number of ones in histogram (the value may occur in data pack), and y is a total number of histogram positions. Default value 1024 means that the interval for the data pack is wider than 1024, so histogram is approximate. Value ‘n/a’ means that the histogram is not available for some reason.

Another example is from a database storing VIN numbers of cars:

select vin as roughattrstats from fact_sales limit 1;

Column 0, table 140

Pack Rows Nulls Min Max Size CMap
———— ————— ———- ——————- ——————- ———- ————————————-
0 65536 0 0AAHL653 8ZZYU286 17 9-26-26-26-26-10-10-10…
1 65536 0 0AAFF860 8ZZUM680 17 9-26-26-26-26-10-10-10…
15255 65536 0 0AAIY380 8ZZND316 17 9-26-26-26-26-10-10-10…
15256 65536 0 0AAGO905 8ZZWK053 17 9-26-26-26-26-10-10-10…
15257 65536 0 0AAEP696 8ZZTT577 17 9-26-26-26-26-10-10-10…
15258 51712 0 0AABB747 8ZZHU603 17 9-26-26-26-26-10-10-10…
———— ————— ———- ——————- ——————- ———- ————————————-

String columns are described by minimal and maximal value (8 first characters, which may define a common prefix), maximal length detected, and CMap statistics. The latter is displayed as a series of numbers indicating a number of different characters on consecutive string positions (example above: only 9 characters on the first position, probably digits ‘0’-’8’). Up to 7 positions of CMap is displayed. If a common prefix is forund, the CMap starts on the first character after prefix. Value ‘n/a’ means that the CMap is not available for some reason.

Unlike roughstats, this switch takes into account also WHERE clause. It is possible to see how much rows were filtered out by WHERE condition, for every data pack. Note that min/max/null/KN information is still global (for the full pack, not the filtered one).

The statistics may not be displayed if no rows is returned from query, or aggregations (GROUP BY, COUNT etc.) are in use. Thus the best form of query is:

SELECT col AS roughattrstats FROM tab1 [WHERE ...] LIMIT 1;

LIMIT 1 will prevent unnecessary displaying too large result.