One area in analytics of any kind that is particularly challenging for users, designers, adminsitrators, and even vendors of analytical technologies (such as databases) is that of searching within string fields - especially long strings. Working with our partners and customers, Infobright has developed a proposal for assisting with this challenge.
First, let's understand the challenge a little better - "Infobright is bad at text fields". This is both somewhat true in some cases and not at all true in others, as is typical with many things software-related. I covered this topic in a previous blog, String CHARs Incident (Part I), http://www.infobright.org/Blog/Archives/david_lutz_blog/2009/02, but let me elaborate a little bit more here.
Text processing, such as filtering and matching, is more computationally intensive for any technology than numeric computation for the simple reason that processing must be done on a byte-by-byte basis as opposed to the internal ability to do bit-wise operations on all manner of numeric data. But this is true for Infobright only when reading the actual data.
Infobright stores metadata about both textual and numeric data fields (in numeric and bitmapped Knowledge Nodes) and these can be used to surface results to a query or assist in efficient filtering. But that is mitigated in certain circumstances. Specifically, these circumstances include searching textual fields that are longer than 64 characters in length or use wildcard values in the beginning or middle of a search string. These eliminate the possibility to leverage the metadata as the bitmaps store positional metadata for only the first 64 characters of the string. Also, if there is a high degree of cardinality in the textual data (many distinct values), it is possible that one would achieve sub-standard compression ratios.
That addresses why text searching is a challenge. Now let's look at a way to improve this performance. It was worded (paraphrased) as follows by a partner prospect:
"We would like to replace queries
SELECT ... FROM table WHERE str=‘value'
by
SELECT ... FROM table WHERE str=‘value' AND cksum=cksum(str)
It is hoped that this method of adding an INTEGER search field will exclude more Data Packs from the search because the Knowledge Grid histograms for numeric values might be more efficient and faster than the Knowledge Grid CMAP (Character Bitmaps)."
MySQL, and by extension Infobright, support two (2) checksum types of functions - MD5() and SHA[1].
From MySQL Administrator's Guide:
MD5(str)
Calculates an MD5 128-bit checksum for the string. The value is returned as a binary string of 32 hex digits, or NULLif the argument was NULL. The return value can, for example, be used as a hash key.
mysql> SELECT MD5(‘testing');
-> ‘ae2b1fca515949e5d54fb22b8ed95575'
This is the "RSA Data Security, Inc. MD5 Message-Digest Algorithm."
SHA1(str), SHA(str)
Calculates an SHA-1 160-bit checksum for the string, as described in RFC 3174 (Secure Hash Algorithm). The value is returned as a binary string of 40 hex digits, or NULLif the argument was NULL. One of the possible uses for this function is as a hash key. You can also use it as a cryptographic function for storing passwords. SHA()is synonymous with SHA1().
mysql> SELECT SHA1(‘abc');
-> ‘a9993e364706816aba3e25717850c26c9cd0d89d'
Both functions return a HEX value which appears as an alphanumeric string when displayed. But since it is actually a base-16 number value, it can be converted to base-10 and stored in an INTEGER field, thereby benefitting from the automatic creation of numeric histogram metadata on LOAD. The only challenge, of course, is to fit the result of the converted checksum value into the largest possible numeric data type, BIGINT. And we have addressed that in this proposal by "trimming" the output of the checksum function with the SUBSTR() function.
To make this work at query time, one must include both the exact string to be matched along with the numerical equivalent of the functional output as described below.
The results are impressive. In the example below, queries were reduced from over 3 minutes when searching on the string text alone to less than 10 seconds when adding the numeric value. And the results are consistent after restarting the server, demonstrating that the performance gains were not the result of caching.
For example:
mysql> CREATE TABLE test_table(
> str_long VARCHAR(20),
> str_numeric BIGINT
> ) ENGINE=BRIGHTHOUSE
> ;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO test_table SELECT vin, CONV(SUBSTR(MD5(vin),1,14),16,10)
> FROM carsales.fact_sales
> LIMIT 65536000
> ;
Records: 65536000 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM test_table
> WHERE
> str_long='ZZF380975Y2999951'
;
+-------------------+-------------------+
| str_long | str_numeric |
+-------------------+-------------------+
| ZZF380975Y2999951 | 38776982624885988 |
+-------------------+-------------------+
1 row in set (3 min 1.58 sec)
mysql> SELECT * FROM test_table
> WHERE
> str_numeric=CONV(SUBSTR(MD5('ZZF380975Y2999951'),1,14),16,10) AND
> str_long='ZZF380975Y2999951'
;
+-------------------+-------------------+
| str_long | str_numeric |
+-------------------+-------------------+
| ZZF380975Y2999951 | 38776982624885988 |
+-------------------+-------------------+
1 row in set (9.35 sec)
Comments (28)