Joinutility seperatorLogin utility separator Infobright.com
   
 
Long query times for SUM() where id IN (1,2,3,4,5,6,67,890,....) group by
Posted: 30 June 2009 06:11 PM   Ignore ]  
Jr. Member
RankRank
Total Posts:  78
Joined  2009-04-28

We’ve got an instance of ICE running w/one table, 30 col, and approx 50 mil rows (approx 250M of infobright disk space).

I think my SUM() WHERE IN () queries are flipping into mysql optimizer…. They’re just taking too long. Any ideas? help on rewriting these so they’ll be more Infobright/ICE friendly?

thanks,
erin

DETAILS:
1)
SELECT SUM(a2) a2,a1_id a1_id,s_date
s_date FROM stats WHERE p_id IN
(75,189,250,345,359,447,471,516,534,575,576,582,607,629,636,661,748,749,991,995,996,998,999,1000,1001,1002,1003,1004,1005,1006,1026,1032,1065,1093,1094,1106,1107,1108,1109,1110,1111,74)
AND s_date BETWEEN 14390 AND 14419 GROUP BY s_date,a1_id;

And it takes mysql 17.66 sec vs IB’s 10.50 sec

2) With only ONE id in the IN:
SELECT SUM(a2) a2,a1_id a1_id,s_date
s_date FROM stats WHERE p_id in (189)  AND s_date BETWEEN 14390
AND 14419 GROUP BY s_date,a1_id;

mysql: .47 sec
IB: 1.5 sec

3) With TWO ids in the IN:
SELECT SUM(a2) a2,a1_id a1_id,s_date
s_date FROM stats WHERE p_id IN (75,189)  AND s_date BETWEEN
14390 AND 14419 GROUP BY s_date,a1_id;
  mysql   0.81 sec
  ib 8 sec

4) This query with two logical OR IDs in the IN
SELECT SUM(a2) a2,a1_id a1_id,s_date
s_date FROM stats WHERE (p_id = 75 OR p_id = 189)  AND
s_date BETWEEN 14390 AND 14419 GROUP BY s_date,a1_id;

mysql: .79 sec
IB: 3.00 sec

5) And if I have a NUMBER of IDS in the logical OR:
SELECT SUM(a2) a2,a1_id a1_id,s_date
s_date FROM stats WHERE (p_id = 75 or p_id = 189 or
p_id = 250 or p_id = 345 or p_id = 359 or p_id =
447 or p_id = 471 or p_id = 516 or p_id = 534 or
p_id = 575 or p_id = 576 or p_id = 582 or p_id =
607 or p_id = 629 or p_id = 636 or p_id = 661 or
p_id = 748 or p_id = 749 or p_id = 991 or p_id =
995 or p_id = 996 or p_id = 998 or p_id = 999 or
p_id = 1000 or p_id = 1001 or p_id = 1002 or
p_id = 1003 or p_id = 1004 or p_id = 1005 or
p_id = 1006 or p_id = 1026 or p_id = 1032 or
p_id = 1065 or p_id = 1093 or p_id = 1094 or
p_id = 1106 or p_id = 1107 or p_id = 1108 or
p_id = 1109 or p_id = 1110 or p_id = 1111 or
p_id = 74) AND s_date BETWEEN 14390 AND 14419 GROUP BY
s_date,a1_id;

  mysql: 1 min 4.02 sec
  IB: 36.5 sec

Profile
 
Posted: 01 July 2009 12:31 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello,

I don’t think we flip into mysql optimizer in this case.

(But the logs would clarify it.)

I’m afraid that it’s just how ICE works currently.

Before I write more, let me ask about the following:

eonarts - 30 June 2009 06:11 PM

And it takes mysql 17.66 sec vs IB’s 10.50 sec

Do you mean standard mysql? In other words, do you have two copies of the same data—one stored in mysql and another in ICE?

Best greetings,

Dominik

Signature 
Profile
 
Posted: 01 July 2009 01:05 PM   Ignore ]   [ # 2 ]  
Jr. Member
RankRank
Total Posts:  78
Joined  2009-04-28

I have a benchmark server set up so I can for now while the dataset is not too large test the times for Infobright & then shut down infobright & bring up mysql & test the times there.

Which log file ?? I’ve got bh.err & the brighthouse.log file. The slow-query.log file is empty (hmmm I better investigate that one!).

thanks,
erin

Profile
 
Posted: 01 July 2009 01:30 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello,

If it switches to mysql, you’ll find it in bh.err.

(But I’d be very surprised if it does. It should be executed without switching. I’ll be looking forward to further news from you.)

eonarts - 01 July 2009 01:05 PM

I have a benchmark server set up so I can for now while the dataset is not too large test the times for Infobright & then shut down infobright & bring up mysql & test the times there.

Are you running all the queries in infobright and then shut down?

Or are you running the first query in infobright, shut down, run it in mysql, shut down mysql, run the second query in infobright, et cetera?

I’m asking about it because infobright keeps a certain amount of last-used data packs in memory. So, if you run a sequence of queries in infobright without a restart, the performance may be different than when restarting after every single query.

For this type of comparative analysis, it would be better to restart after every single query.

Thanks a lot and best greetings,

Dominik

Signature 
Profile
 
Posted: 01 July 2009 01:41 PM   Ignore ]   [ # 4 ]  
Jr. Member
RankRank
Total Posts:  78
Joined  2009-04-28

Ideally I’d have two servers of the same hardware but this is all I’ve got & it works rather nicely. So yes, run queries in infobright, shut down infobright, then start up mysql & run them in mysql. I use myisam as Innodb needs a warm up & takes up more disk space.

I’ll restart IB everytime I run the query then!  Right now I’ve got some mysql testing to do & I’ll see which one they want to have priority.

thanks for the quick reply!

erin

Profile
 
Posted: 01 July 2009 01:49 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello,

I realize that I’ve just generated more work for you, related to repeating the experiments.grin

But it would be indeed good to see what happens if we restart after every query.

(And also to see bh.err)

Please take your time. Whenever you have new results, we’ll be happy to look at them.

Best greetings and thanks,

Dominik

Signature 
Profile
 
Posted: 01 July 2009 01:52 PM   Ignore ]   [ # 6 ]  
Jr. Member
RankRank
Total Posts:  78
Joined  2009-04-28

hehe Lucky I love doing this sort of thing!

erin

Profile
 
Posted: 01 July 2009 02:21 PM   Ignore ]   [ # 7 ]  
Member
Avatar
RankRankRank
Total Posts:  160
Joined  2009-04-01

Hi Erin,

Just a couple of more points to consider as you continue benchmarking.

The results your are seeing are probably reasonable when you consider the following;

1. ICE stores your data into data packs for every 65,536 values.

2. You have a table with 50M rows which breaks down to 763 data packs.

3. Looking at the query with IN (189)—> 1.5 secs, depending on your data distribution this could resolve to only a few data packs that would need to be decompressed and scanned to resolve your query.

4. When you increase the scope of your query with IN (75,189,250,345,359….....,1111,74)—> 10.5 secs, then you increase the probability that many more data packs need to be decompressed and scanned to resolve your query. This will directly relate to longer execution times.

If you want to check to see if the SQL is switching to MySQL then you can check using the “SHOW WARNINGS” statement. Also as Dominik mentioned previously the bh.err file will indicate if the MySQL Engine is used when you see a message like

Table ./new_carsales/dim_msa (2756accessed by MySQL engine

You may also want to check some of our SQL optimization suggestions. The Infobright engine works best if you avoid using OR conditions as mentioned in the wiki pages here.

The slow query log is controlled by the MySQL framework and can be controlled using standard MySQL functionality as documented here.

If you are still having trouble with the MySQL logs, then you may want to check out the following forum post;

Enable MySQL Logging

Please post back to keep us up to date on your progress.

Regards, Bob Newell

[ Edited: 01 July 2009 08:01 PM by Bob Newell]
Profile
 
Posted: 01 July 2009 06:15 PM   Ignore ]   [ # 8 ]  
Jr. Member
RankRank
Total Posts:  78
Joined  2009-04-28

I wiped all my data out of both the MySQL table and the Infobright (IB) table. I first worked with the MySQL instance & then switched to IB.

ENGINE TABLE NUM ROWS   COMMENTS
mysql   stats   56,173,615   query_cache_size =0, query_cache_type=OFF
IB         stats   56,173,615   restart IB between each query

1) SELECT w/SUM(), IN (42 ids) GROUP BY (see above for more details on query)
  mysql   2610 rows in set (56.65 sec) 
  IB         2610 rows in set (9.96 sec) 
RESTART infobright  
2) SELECT w/SUM, IN (1 id) GROUP BY (see above for more details on query)
  mysql   268 rows in set (1.46 sec)
  IB         268 rows in set (5.05 sec)
RESTART infobright  
3)  SELECT w/SUM, IN (2 id’s) GROUP BY (see above for more details on query)
mysql   335 rows in set (1.12 sec)
IB         335 rows in set (8.57 sec)
RESTART infobright
4) SELECT w/SUM, IN (id OR id)    GROUP BY (see above for more details on query)
mysql   335 rows in set (0.85 sec)
IB         335 rows in set (5.82 sec)
RESTART infobright
5) SELECT w/SUM, IN (id OR ... 42 ids in all) GROUP BY (see above for more details on query) 
mysql   2610 rows in set (27.85 sec) 
IB         2610 rows in set (29.90 sec) 
Then w/o a RESTART of infobright I rant the following queries:
  #4 query   335 rows in set (2.33 sec)
  #5 query   2610 rows in set (26.16 sec)
  #1 query   2610 rows in set (6.06 sec)
  #2 query   268 rows in set (1.19 sec)

So I can see that IB is warmed up & that affects the query speed.

Then I did this on the bh.err file:
  grep Table bh.err |grep -v exist
(I don’t have a mysql.slow_log table or a mysql.general_log table yet. OH? & I guess that’s the next thing to look into setting up!)
And nothing came through. 

Interesting benchmarks ....

erin

Profile
 
Posted: 02 July 2009 12:01 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello,

Erin, thanks a lot for more tests!

Bob, thanks a lot for providing a sounder background to the discussion!

eonarts - 01 July 2009 06:15 PM

Then w/o a RESTART of infobright I rant the following queries:
  #4 query   335 rows in set (2.33 sec)
  #5 query   2610 rows in set (26.16 sec)
  #1 query   2610 rows in set (6.06 sec)
  #2 query   268 rows in set (1.19 sec)

So I can see that IB is warmed up & that affects the query speed.

Yes, it’s warmed up. The queries #1-#5 generally refer to the same “areas” of data. Hence, if the data is not so large and if the knowledge grid is selective enough, then the data packs necessary while executing the previous queries will remain in memory (if there is enough memory) and help in executing the next queries. Of course it does not need to be like this. Nevertheless, it’s a very helpful feature.

eonarts - 01 July 2009 06:15 PM

1) SELECT w/SUM(), IN (42 ids) GROUP BY (see above for more details on query)
  mysql   2610 rows in set (56.65 sec) 
  IB         2610 rows in set (9.96 sec) 
RESTART infobright  
2) SELECT w/SUM, IN (1 id) GROUP BY (see above for more details on query)
  mysql   268 rows in set (1.46 sec)
  IB         268 rows in set (5.05 sec)
RESTART infobright  
3)  SELECT w/SUM, IN (2 id’s) GROUP BY (see above for more details on query)
mysql   335 rows in set (1.12 sec)
IB         335 rows in set (8.57 sec)
RESTART infobright
4) SELECT w/SUM, IN (id OR id)    GROUP BY (see above for more details on query)
mysql   335 rows in set (0.85 sec)
IB         335 rows in set (5.82 sec)
RESTART infobright
5) SELECT w/SUM, IN (id OR ... 42 ids in all) GROUP BY (see above for more details on query) 
mysql   2610 rows in set (27.85 sec) 
IB         2610 rows in set (29.90 sec)

It shows the differences in the mysql/infobright performance dynamics.

It also shows that we will still need to work on “IN” in the future.

Erin, I hope that, after all, you find us efficient enough for the queries you’re interested in. Your queries are certainly a good “IN” case study and we’ll try to use them as a future reference point.

In particular, I believe that #3 should be faster than #4. I’ll double check it…

Best greetings,

Dominik

Signature 
Profile