Joinutility seperatorLogin utility separator Infobright.com

David Lutz's Blog

23
Feb

String CHARs Incident

David Lutz's photo
by David Lutz     Mon, Feb 23, 2009

This week's post is not strictly related to data warehousing but does come up frequently enough in usage of Infobright for analytical, Business Intelligence and data warehousing databases that I thought I would share it here.  Actually, our Sales Engineer-at-Large, Susan Bantin, made the suggestion:  "David, Your first two paragraphs would make a good answer to a community question about using long strings."

For additional, 3rd-party discussion on this topic, please see Kevin Galligan's blog here (and an update here).

The material below was in response to the following question from an Infobright consulting partner (paraphrased):

"This customer’s data has many 80+ character descriptions that would need to be returned and/or used in WHERE clauses.”

Column-store databases, in general, offer improved performance over row-based databases when the usage pattern is more column- than row-oriented in nature (obviously).  For example, counting the number of occurrences of a value in one column out of possibly dozens of columns in a table reduces the I/O compared to that which a row-oriented database would have to perform because only the column being evaluated would need to be read from disk in a columnar database.  Infobright offers benefits in addition to that when either a) an aggregate value can be provided directly from our metadata layer, the Knowledge Grid, or b) the Knowledge Grid can “guide” the Executor to only the Data Packs (individual buckets of 65,536 values in that column) for reading, decompression and evaluation.  Generally speaking, Infobright is optimized for resolving analytical queries “about” the data over retrieving detailed, specific pieces of content in the data.

However, any database is going to have to evaluate virtually every row element for string searches – especially where wildcards are involved – and compare them byte-by-byte to the filter pattern.  In these cases, it appears there is one search to find a string value – this could be very fast depending on the type/nature of the filter – and then another search where the strings in the database match a string filter condition.  This will almost always be slower than numerical operations.  It is not an analytical query in nature and does not benefit from the Knowledge Grid.  Even though Infobright maintains some metadata on character data types (character bitmaps, or CMAP Knowledge Nodes), the bitmap is stored for only the first 64 bytes of a string and can only be leveraged when there are no leading or embedded wildcard characters (trailing wildcards are not a problem).  Supply a 65+ byte string as a filter and you’re back to decompressing every Data Pack and performing a byte-by-byte comparison of that column value for every row.