Joinutility seperatorLogin utility separator Infobright.com
   
 
Selects on varchar fields are slow
Posted: 08 June 2010 10:43 AM   Ignore ]  
Newbie
Rank
Total Posts:  13
Joined  2009-10-07

Hello,

this one requires rather verbose explanation to ask the questions it’s posted for, so let me move on. In project I’m working on, data is collected on daily basis, so there’s a separate table for each day containing daily data. If data changes, table gets dropped and created+loaded again. That’s obvious, just to briefly describe the procedure. So let’s say I have two weeks of data around, i.e 14 tables of following structure:

CREATE TABLE `aggregate_2010_05_24` (
  `pub_id` int(10) DEFAULT NULL,
  `url_id` int(10) DEFAULT NULL,
  `item_id_one` int(10) DEFAULT NULL,
  `item_id_two` int(10) DEFAULT NULL,
  `pub_domain` varchar(512) DEFAULT NULL,
  `imps` int(11) DEFAULT NULL,
  `clicks` int(11) DEFAULT NULL,
  `convs` int(11) DEFAULT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=utf8;

each table has approximately 300000 records and 50000 to 100000 unique pub_domain field values (the last part is important).
Now I’m trying to do this:

SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24`  GROUP BY `pub_domain` limit 20;

and it works pretty well, in 0.30 sec. However the goal is to pull data for all days available and hen get the summary. So next thing I’m doing is a UNION of several queries like the one above:

SELECT data.`pub_domain`,  SUM(data.imps) as imps, SUM(data.clicks) as clicks, SUM(data.convs) as convs FROM ( (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_25`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_26`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_27`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_28`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_29`    GROUP BY `pub_domain`) ) AS data   GROUP BY data.`pub_domain` ORDER BY imps DESC, clicks DESC, convs DESC LIMIT 30;

and it works OK, taking 3.46 sec to proceed. Then I’m adding few more days, query gets a bit slower (about 0.8 sec slower for each day, when suddenly after another addition it just hangs! So this one:

SELECT data.`pub_domain`,  SUM(data.imps) as imps, SUM(data.clicks) as clicks, SUM(data.convs) as convs FROM ( (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_25`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_26`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_27`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_28`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_29`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_30`    GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_31`    GROUP BY `pub_domain`)  UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_06_01` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_06_02`    GROUP BY `pub_domain`)) AS data   GROUP BY data.`pub_domain` ORDER BY imps DESC, clicks DESC, convs DESC LIMIT 30;

hangs forever. Well, probably not forever and certainly not just hangs - at some point:
- system starts to show heavy disk writes (not swapping though),
- ‘show processlist’ says ‘Copying to tmp table’,
- ‘top’ shows load average spike
- free disk space on partition runs out at approximate speed of 40MB/second

And, well, it goes and then goes on - I killed the query after few minutes.
The strange thing here is it started after a single step of adding new subquery to UNION, i.e. was working fine, then next query placed and bang! - it goes down. That led me to a couple of questions I’d like to finally ask here:

- is there a sort of memory threshold that, when overrun, causes this behavior?
- if yes, can it be enhanced at some configuration level?
- what does IB write to disk at that speed and amounts?
- what does generally “Copying to tmp table” message mean? Is it table in memory, and of what type (supported by what engine) is that?
- I guess this is connected with a fact that field is of varchar type. Is it really? And if it is, are there ways to fix/overcome that? I’ve tried similar approach on smaller varchar fields (like of 30 symbols) and haven’t got any reasonable improvement.

Your answers would really help here because this becomes a real problem in our data processing.

Thank you.

Profile
 
Posted: 09 June 2010 12:28 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi,

UNION ALL is not the most efficient way to process data, and - if possible - it is better to just keep all the data in one table. Or to keep just the last day in a separate table (if updating is needed), then moving its contents to the main table every night. Note that if the table has a date filed, and if it is loaded chronologically, then adding a limiting condition on dates is very efficient. Queries should work well even for a billion of rows (10 years of data). Another note about the scheme: if UTF8 is not really important in domain names, it is better to use latin1 or other 1-byte charset.

There is a threshold, indeed, for which data is cached on disk before the next UNION operations are performed. This is an internal Infobright caching mechanism, not a system swap. The threshold is based partially on a ServerMainHeapSize, so increasing this value in brighthouse.ini configuration file may change the situation. Additionally, the new release of Infobright (3.4, available soon) should perform better in UNION ALL queries.

‘Show processlist’ is not a good way to profile Infobright queries. Use our Query Execution Log instead:
http://www.infobright.org/wiki/Query_Execution_Log/
Although in your particular situation it will not help too much

Regards,

[ Edited: 09 June 2010 12:35 AM by Jakub Wroblewski]
Signature 
Profile
 
Posted: 09 June 2010 05:34 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  13
Joined  2009-10-07

Hi,

thanks for the fast reply! Few additional comments and questions below:

Jakub Wroblewski - 09 June 2010 12:28 AM

Hi,

UNION ALL is not the most efficient way to process data, and - if possible - it is better to just keep all the data in one table. Or to keep just the last day in a separate table (if updating is needed), then moving its contents to the main table every night. Note that if the table has a date filed, and if it is loaded chronologically, then adding a limiting condition on dates is very efficient. Queries should work well even for a billion of rows (10 years of data). Another note about the scheme: if UTF8 is not really important in domain names, it is better to use latin1 or other 1-byte charset.

Unfortunately there’s no way (in my situation) to store data in a single table because data for any day can eventually be updated. So I need separate tables to make sure I can re-load it for any day. On UTF8 - in this case it’s true, however in project I’m working on, there are similar situations with fields that definitely have to support UTF8…

Jakub Wroblewski - 09 June 2010 12:28 AM

There is a threshold, indeed, for which data is cached on disk before the next UNION operations are performed. This is an internal Infobright caching mechanism, not a system swap. The threshold is based partially on a ServerMainHeapSize, so increasing this value in brighthouse.ini configuration file may change the situation. Additionally, the new release of Infobright (3.4, available soon) should perform better in UNION ALL queries.

OK, thanks for the tip, I’ll try increasing that parameter’s value. BTW, is there more or less settled date on 3.4 release? Can we expect it in like a week, a month or a few months? I’m asking because I should make some decision on whether I can just wait and see if that one helps resolving my issue or deal with it in some different way (like storing IDs istead of strings and keeping string values dictionary separte, mapping query results to it.

Jakub Wroblewski - 09 June 2010 12:28 AM

‘Show processlist’ is not a good way to profile Infobright queries. Use our Query Execution Log instead:
http://www.infobright.org/wiki/Query_Execution_Log/
Although in your particular situation it will not help too much

Regards,

Thanks again.

Profile
 
Posted: 09 June 2010 07:09 AM   Ignore ]   [ # 3 ]  
Newbie
Rank
Total Posts:  13
Joined  2009-10-07

Also, are there any other configurable ways to adjust caching mechanism thresholds?

Profile