Hello,
this one requires rather verbose explanation to ask the questions it’s posted for, so let me move on. In project I’m working on, data is collected on daily basis, so there’s a separate table for each day containing daily data. If data changes, table gets dropped and created+loaded again. That’s obvious, just to briefly describe the procedure. So let’s say I have two weeks of data around, i.e 14 tables of following structure:
CREATE TABLE `aggregate_2010_05_24` (
`pub_id` int(10) DEFAULT NULL,
`url_id` int(10) DEFAULT NULL,
`item_id_one` int(10) DEFAULT NULL,
`item_id_two` int(10) DEFAULT NULL,
`pub_domain` varchar(512) DEFAULT NULL,
`imps` int(11) DEFAULT NULL,
`clicks` int(11) DEFAULT NULL,
`convs` int(11) DEFAULT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=utf8;
each table has approximately 300000 records and 50000 to 100000 unique pub_domain field values (the last part is important).
Now I’m trying to do this:
SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24` GROUP BY `pub_domain` limit 20;
and it works pretty well, in 0.30 sec. However the goal is to pull data for all days available and hen get the summary. So next thing I’m doing is a UNION of several queries like the one above:
SELECT data.`pub_domain`, SUM(data.imps) as imps, SUM(data.clicks) as clicks, SUM(data.convs) as convs FROM ( (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_25` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_26` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_27` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_28` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_29` GROUP BY `pub_domain`) ) AS data GROUP BY data.`pub_domain` ORDER BY imps DESC, clicks DESC, convs DESC LIMIT 30;
and it works OK, taking 3.46 sec to proceed. Then I’m adding few more days, query gets a bit slower (about 0.8 sec slower for each day, when suddenly after another addition it just hangs! So this one:
SELECT data.`pub_domain`, SUM(data.imps) as imps, SUM(data.clicks) as clicks, SUM(data.convs) as convs FROM ( (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_24` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_25` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_26` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_27` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_28` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_29` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_30` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_05_31` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_06_01` GROUP BY `pub_domain`) UNION ALL (SELECT `pub_domain`, SUM(imps) as imps, SUM(clicks) as clicks, SUM(convs) as convs FROM `aggregate_2010_06_02` GROUP BY `pub_domain`)) AS data GROUP BY data.`pub_domain` ORDER BY imps DESC, clicks DESC, convs DESC LIMIT 30;
hangs forever. Well, probably not forever and certainly not just hangs - at some point:
- system starts to show heavy disk writes (not swapping though),
- ‘show processlist’ says ‘Copying to tmp table’,
- ‘top’ shows load average spike
- free disk space on partition runs out at approximate speed of 40MB/second
And, well, it goes and then goes on - I killed the query after few minutes.
The strange thing here is it started after a single step of adding new subquery to UNION, i.e. was working fine, then next query placed and bang! - it goes down. That led me to a couple of questions I’d like to finally ask here:
- is there a sort of memory threshold that, when overrun, causes this behavior?
- if yes, can it be enhanced at some configuration level?
- what does IB write to disk at that speed and amounts?
- what does generally “Copying to tmp table” message mean? Is it table in memory, and of what type (supported by what engine) is that?
- I guess this is connected with a fact that field is of varchar type. Is it really? And if it is, are there ways to fix/overcome that? I’ve tried similar approach on smaller varchar fields (like of 30 symbols) and haven’t got any reasonable improvement.
Your answers would really help here because this becomes a real problem in our data processing.
Thank you.
