Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
Spread work over cores
Posted: 26 October 2010 02:32 PM   Ignore ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04

Hi,

I got one table with 16000 rows and one with 40 000 000. ICE 3.4.2.

In order to utilize our 16 cores (32gb mem) I’m thinking of if there are any gains in splitting the job over cores.

select be.id, be.segment, be.timeindex, td.somecol
from (select id, segment, timeindex from events limit 7501,2500) be
, data td
where be.id = td.id
and td.timeindex between be.timeindex-4000 and be.timeindex
and td.somecol > 0
group by be.segment
order by id, td.timeindex;

Since the events table contains 16000 post I can split these into chunks of 2500 rows.
The timeindex in this case means 4 sec and each join will return 40 rows (in total 16000*40).

Running the query without limit clause took 739 sec. When starting 7 simultaneous processes with 2500 rows from events I reduced the total time to 467 sec (some finishing in 167 sec and the last to finish in 467 sec). Although the execution time was only 63% of the original query I wonder if someone has any ideas on how to optimize this?

And also to hear your best trick to fully use multicore environments?

//erik

Profile
 
Posted: 26 October 2010 03:01 PM   Ignore ]   [ # 1 ]  
Administrator
RankRankRankRank
Total Posts:  448
Joined  2010-09-22

Hi Erik,

Infobright Enterprise Edition does support multi-core, but ICE does not. 

You may also want to try to remove the join between these two tables.  One one extreme, you can try to condense the dataset to potentially one table (the compression should help alleviate space needs). 

Cheers,

Jeff

Signature 

jeff kibler

Profile
 
Posted: 26 October 2010 03:29 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  25
Joined  2009-08-06

Hi,

Another good place to check is the Tips & Tricks section of the wiki

http://www.infobright.org/wiki/Tips_&_Tricks/

Shawn

Profile
 
Posted: 26 October 2010 05:44 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

First of all, I would avoid using a subquery. Joining with a subquery result means unnecessary materialization of data, and disabling most of Knowledge Grid optimizations. If the ‘events’ table consists of any kind of key column, please use it in WHERE part instead of LIMIT. Suppose that be.timeindex is such a column:

select be.idbe.segmentbe.timeindextd.somecol
from events be
data td
where be
.id td.id
and td.timeindex between be.timeindex-4000 and be.timeindex
and td.somecol 0
AND be.timeindex BETWEEN [some values defining a 2500-row fragment]
AND td.timeindex BETWEEN [similar conditionimplication of the above]
group by be
.segment
order by id
td.timeindex

Query Execution Log is also a good tool for identifying bottlenecks.

Regards,

Signature 
Profile
 
Posted: 28 October 2010 04:53 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04

So I rewrote the query, still determined to spread the query over cores smile

select be.id, be.segment, be.timeindex, td.somecol
from events be
, data td
where be.id < 1000
and be.id = td.id
and td.timeindex between be.timeindex-4000 and be.timeindex
and td.somecol > 0
group by be.segment
order by id, td.timeindex;

And then replacing highlighted into two other querysessions with
* between 1000 and 2000
* > 2000

The first query as well as the third was done in 290 sec. Using between took 591 (looking in bh.err it seems like a traditional full table scan on the data table)!


But by replacing “be.id between 1000 and 2000” with
“be.id >=1000 and be.id <= 2000”
the query was as fast as the other ones running in 291 sec.

I thought that “between” was the problem and replaced also the between on timeindex but that made no difference, still 291 sec.

But to summarize the query of 739 sec now elapses in 291 sec using three cores

//erik

Profile
 
Posted: 28 October 2010 05:12 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Erik,

erik.svanberg - 28 October 2010 04:53 AM

But to summarize the query of 739 sec now elapses in 291 sec using three cores

Wonderful!

erik.svanberg - 28 October 2010 04:53 AM

But by replacing “be.id between 1000 and 2000” with
“be.id >=1000 and be.id <= 2000”
the query was as fast as the other ones running in 291 sec.

Very interesting. May I ask a question?

(It’s just for curiosity. I’m not even sure whether the results are going to be valid.)

Would you please try with the following:

“be.id > 1000 and be.id < 2000”

“be.id >= 1000 and be.id < 2000”

“be.id > 1000 and be.id =< 2000”

Many thanks and best greetings!

Dominik

Signature 
Profile
 
Posted: 28 October 2010 05:40 AM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi Erik,

Please try this:
http://www.infobright.org/wiki/Query_Execution_Log/

BETWEEN and the pair of inequalities should be executed in exactly same way, and I would expect that caching mechanism is responsible for any differences in time.

Regards,

Signature 
Profile
 
Posted: 28 October 2010 12:01 PM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04
Dominik Slezak - 28 October 2010 05:12 AM

Would you please try with the following:

“be.id > 1000 and be.id < 2000”  -> 341 sec

“be.id >= 1000 and be.id < 2000” -> 345 sec

“be.id > 1000 and be.id =< 2000” -> 291 sec

 

//erik

Profile
 
Posted: 28 October 2010 12:04 PM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04

Jakub: I ‘m interested in what you mean about caching. My understanding is that there are no caching in the architecture? If I run one query a number of times it will always take the same number of seconds?

//erik

Profile
 
Posted: 28 October 2010 10:38 PM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Erik,

Jakub means that the most lately queried data packs remain in memory until you restart the server or until memory needs to be freed in order to put there other data packs.

Therefore, an important question is whether you restarted the server before re-running your queries with different “between” and “inequality and inequality” options. If you didn’t, then you might see that the same (or similar) query suddenly runs faster. But if you did, then the difference in times between “between” and “inequality and inequality” options would be truly mysterious.

By the way, thanks a lot for answering my question about different inequalities. All these cases are significantly faster than the result that you originally reported for “between”. Thus, the first guess is that some data packs are already in memory. The best way to proceed is therefore to check all “between” and “inequality and inequality” options one more time, but making sure that the server is restarted after each single query.

Best greetings,

Dominik

Signature 
Profile
 
Posted: 29 October 2010 03:27 AM   Ignore ]   [ # 10 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04

restart
“be.id >= 1000 and be.id <= 2000”  -> 293 sec
restart
“between be.id 1000 and 2000”  -> 476 sec
restart
“be.id > 1000 and be.id < 2000”  -> 345 sec
restart
“be.id >= 1000 and be.id < 2000” -> 343 sec
restart
“be.id > 1000 and be.id =< 2000” ->  294 sec

running three parallel queries:
restart
started first: “be.id <1000 ”  -> 278 sec
started second: “be.id >= 1000 and be.id <= 2000”  -> 357 sec
started third: “be.id > 2000”  -> 356 sec

I cannot get a grip of the caching, we have 36 gb memory, Server Main Heap Size set to 24000. When running these three queries I get about 2% used memory. Caching seems not to play a big difference in these cases (if I have not done anything wrong in the setup).

//erik

Profile
 
Posted: 29 October 2010 06:02 AM   Ignore ]   [ # 11 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  964
Joined  2008-08-18

Hi !

To make things more complicated - but more realistic - there are 2 cache mechanisms involved.
1. Operating system reads needed files and stores their content in memory. Subsequent reads targeting same files do not need to get data from disk, they get it from system cache. For some queries this makes big difference - eg. if reading from disk is a bottleneck, so CPUs are working at 50%, with data in system cache disk is no more a bottleneck and CPU can work at 100%

2. Data read from disk by IB is compressed. It gets decompressed and stored in application-level cache. A query needing data already present in this cache does not need to use disk and decompression at all - data is ready to be used.

Both caches are managed independently - the first one by OS, the second one by IB. To clear the first one in Linux either restart system or do
echo 3 > /proc/sys/vm/drop_caches (needs root). To clear the second cache restart IB.

Profile
 
Posted: 29 October 2010 06:08 PM   Ignore ]   [ # 12 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

erik.svanberg - 29 October 2010 03:27 AM

restart
“be.id >= 1000 and be.id <= 2000”  -> 293 sec
restart
“between be.id 1000 and 2000”  -> 476 sec

Could you, please, generate query logs for these two queries? The times are looking quite suspicious.
Manual how to generate the logs:
http://www.infobright.org/wiki/Query_Execution_Log/

Regards,

Signature 
Profile
 
Posted: 01 November 2010 02:18 PM   Ignore ]   [ # 13 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-06-04

Hi,

attached bh.err for those two queries

//erik

File Attachments 
bh.err.txt  (File Size: 10KB - Downloads: 273)
Profile
 
Posted: 02 November 2010 02:43 AM   Ignore ]   [ # 14 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Thanks, Erik, it explains a lot. The faster query was better optimized on syntactical level. We’ll take a look, why.

Regards,

Signature 
Profile
 
Posted: 08 November 2010 11:56 AM   Ignore ]   [ # 15 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Erik,

I discussed with Jakub today. Let me comment further on his behalf.

By:

Jakub Wroblewski - 02 November 2010 02:43 AM

We’ll take a look

he meant that we’ll introduce analogous optimization as soon as our roadmap allows.

Then both versions will run comparably fast.

For now, please use the faster query version.

We hope that it’s not a big problem for you.

Many thanks and best greetings,

Dominik

Signature 
Profile
 
   
1 of 2
1