Joinutility seperatorLogin utility separator Infobright.com
   
 
Character fields > 64 in length
Posted: 21 December 2009 01:03 AM   Ignore ]  
Newbie
Rank
Total Posts:  33
Joined  2009-05-13

Hi,  I installed 3.3 a few days ago and have verified that the “trickle feed” problem does indeed seem to be corrected. Thanks.

Having that corrected has now revealed another performance issue that I don’t quite understand.

Assume that I have the following test table with 4 fields.
id       BIGINT
clientip VARCHAR(15)
subject VARCHAR(100)
sender VARCHAR(100)


If I issue the command
SELECT subject from test_table WHERE clientip = ‘123%’ the system reports no matching rows after about 15 seconds.  A second try takes about 1 second to confirm no rows exist.  I assume the faster second query is because the knowledge grid or cmap was loaded into ram the second time. This is how i would expect it to work.

However, when I execute the same query using any of the longer varchar fields in the where clause it takes about 1.5 minutes to report no rows.  Subsequent runs take just as long.
SELECT subject from test_table WHERE subject = ‘123%’

I was under the impression that the longer VARCHAR fields had no performance impact as long as IB is able to use the first 64 bytes of data to match the field or not.  In my example none of the records started with 123 in either case but performance is continually bad only on the second instance.

Can’t help but think there is a lingering performance issue related to long character fields.  Overall it’s definitely better than it was in 3.2 though.

Thanks again.

Mike

[ Edited: 21 December 2009 11:29 AM by Mike M]
Profile
 
Posted: 21 December 2009 04:27 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  648
Joined  2008-08-18

Hi !

Basically, you are right in saying that “there is a lingering performance issue related to long character fields”. It is rather trivial - working with long data takes more time than working with short data.
Knowledge grid is not able to pinpoint the relevant rows. If your data is sorted on ‘subject’ field, then KG would be more efficient. Even if part of data is eliminated using KG, the rest must be fetched, decompressed and examined item by item. Decompression of long strings is slow - this is the main reason why it takes more time for VARCHAR(100) then for VARCHAR(15). Then, decompressed data can be cached in memory to speed up subsequent queries. Short strings can fit in cache, long ones apparently cannot, so they must be decompressed for each query. You may try adding memory and set MainHeapSize to a larger value to have larger cache and therefore speed up the subsequent query execution.

Anyway, if you can post bh.err for both queries, we would be able to tell more…

Profile
 
Posted: 21 December 2009 11:27 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  33
Joined  2009-05-13

Thanks for the feedback. I’ll get you examples of the logs as soon as I have a moment to work on it again.

In the meantime, it was my understanding from previous discussions with david and dominik that the length of a char field makes little difference in overall searching performance as long as the data you are searching for in that field is contained within the first 64 characters of the field (stored in KG).  That is the part that doesn’t seem to hold true based on my recent testing.  I thought the only time the full length of the field should come into play is when actually retrieving the field for display or when needing info beyond the first 64 bytes in the where clause - such as a in a wildcard that begins with “%”.

My examples above should have been able to eliminate all of the rows without having to decompress since there are no records in the output and the first 3-characters of the field (my search string) would have been all it takes to eliminate the record as a match. I could understand how the first time it may have to decompress the data to load the KG but I would think thereafter it would not need that step.

In my example the clientip field is no more ordered than the varchar100 fields yet the performance difference is huge. That’s why it got my attention.

I’ll experiment some more and get the log file examples to you.

Mike

Profile
 
Posted: 21 December 2009 12:09 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  578
Joined  2008-08-18

Hi,

Mike M - 21 December 2009 11:27 AM

My examples above should have been able to eliminate all of the rows without having to decompress since there are no records in the output and the first 3-characters of the field (my search string) would have been all it takes to eliminate the record as a match.

What kind of data were present in the column? Any of the rows involved numbers in the first three positions?

Regards,

Signature 
Profile
 
Posted: 21 December 2009 12:18 PM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  33
Joined  2009-05-13

yes, the clientip field start with digits while the varchar fields generally start with alpha.

interesting idea.  Maybe IB is handling the field differently just because the data begins with a number even though it is still an alpha-numeric field.  Very interesting if so.  I can’t imagine why that would matter but you’re the expert grin

Mike

Profile
 
Posted: 21 December 2009 12:40 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  578
Joined  2008-08-18

Hi,

This is rather about the potential of CMAP to help with your query. VARCHAR are handled the same way regardless of they contents (numbers etc).

CMAPs stores information about existence of a given character (e.g. ‘a’) on a given position in a string, over all the rows in data pack. I.e. if any row in our data pack starts with ‘a’, then information “‘a’ is possible on position 1” is stored. Similarly on the second position etc. Which means, that if we have the following data:

111
233
727

...then we cannot exclude ‘123%’ on rough level, because ‘1’ is possible on position 1, ‘2’ is possible on position 2, ‘3’ is possible on position 3 (and we are not storing any information about occurences of such sequence together in the same row). This is how CMAPs work.

Regards,

Signature 
Profile
 
Posted: 22 December 2009 12:40 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  33
Joined  2009-05-13

Wow, that’s the best explanation of CMAP that I have ever seem. Understand exactly and also see why my results are so vastly different with the numeric data in one field.  Now that I know how it works maybe I can get creative on some other performance optimization techniques of my own on my end.

Thanks

Profile