Joinutility seperatorLogin utility separator Infobright.com
   
 
Can I speed up my join queries?
Posted: 16 December 2008 09:07 PM   Ignore ]  
Newbie
Rank
Total Posts:  4
Joined  2008-12-15

Hello ICE forum members,

I’m experimenting with ICE, using a simple 3-table star schema design (1 fact, 2 dimension tables). No indexes or constraints defined.

I have a script to incrementally load the fact table with half a million records every 30 seconds, and another script to run a battery of test queries, timing the responses, so that I can assess a) query performance as the fact table grows, b) query / load concurrency.
Loads are all done using ‘LOAD DATA INTO’.

All test queries are simple aggregates with joins from fact to both dimensions, and all predicates on the dimensions.

I found that the queries degraded in performance much faster than I had hoped, to the point that some queries took over 2 minutes with only 177Million rows in the fact table. (The time dimension has only 1800 rows, and the other dimension about 100,000 rows)

I played with one of the queries a bit.. I simplified the predicates to the point where I was only predicating on the keys of the two dimension tables - see below.

select sum(f.ifintotaloctets) AS "In" sum(f.ifouttotaloctets) AS "Out"
from fact fdim_config ddim_time t
where t
.timestamp_key between 1227762000 and 1227769140 
and d.ifpid 300021 
and f.ifpid d.ifpid
and f.timestamp_key t.timestamp_key 

This takes about 6 seconds to run.

Modifying the query to predicate instead directly on the fact table foreign key columns and remove the joins, as follows:

select sum(fact5.ifintotaloctets) AS "BytesInShown" sum(fact5.ifouttotaloctets) AS "BytesOutShown"
from fact fact5
where fact5
.timestamp_key between 1227762000 and 1227769140 
and fact5.ifpid 300021 

This one runs near instantaneously.. about 0.01 sec. Huge difference.


Is this expected behavior, or is there anything i can do to speed up the queries with the joins? My instinct would be to cerate indexes on the join keys, but I realize that this is not the InfoBright way.
Do you think the ‘knowledge grid’ could somehow be incorrect? Do I need to run some equivalent of ‘update statistics’ for the knowledge grid? I’m hoping there is some magic that I have missed, that will dramatically speed up my star-schema join queries.

Thanks in advance for your help and suggestions.

Cheers
Bob

PS I’m testing 64-bit RHEL5 on an 8-way/24GB RAM with data & cache on 800GB RAID0 internal array, with brighthouse.ini configured for 24GB per installation guide.

Profile
 
Posted: 17 December 2008 05:46 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hello Bob,

I think your first query should work faster in an ICE version to come soon (an optimization was added recently). Probably the knowledge grid was able to optimize the second query much more efficient, thanks to explicit condition on f.ifpid. In this case a reformulation of the first one should also work…

Anyway, the most direct way of analyzing performance of queries in ICE is to look into server logs (usually in ‘bh.err’ file in the database directory). There is a series of timestamped steps performed by engine for each query phase (starting with something like “Initial execution plan”, up to “Displaying result”). Could you, please, put the logs for these two queries here?

As about parallel load and queries: AFAIK we are not supporting it, i.e. queries are either waiting, or even failing when a load is performed.

Regards,

Signature 
Profile
 
Posted: 17 December 2008 09:12 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  4
Joined  2008-12-15

Hi Jakub, Thank you very much for your reply.

To get any info logged to bh.err, I had to set ‘ControlMessages = 2’ in the brighthouse.ini file… is this correct?
I did that, and here’s what I get..

1. First query, with dimension table joins

select sum(f.ifintotaloctets) AS "In" sum(f.ifouttotaloctets) AS "Out"
from fact fdim_config ddim_time t
where t
.timestamp_key between 1227762000 and 1227769140 
      
and d.ifpid 300021 
      
and f.ifpid =  d.ifpid
      
and f.timestamp_key t.timestamp_key ;

bh.err:
2008-12-17 08:56:24 [3] T:-TABLE_ALIAS(T:0)
T:-TMP_TABLE(T:-1)
T:-TABLE_ALIAS(T:1)
T:-2.JOIN(T:-3)
T:-TABLE_ALIAS(T:2)
T:-2.JOIN(T:-4)
F:CREATE_FILTER(T:-2,TERM(t:-4 a:0),BETWEEN,TERM(vn:1227762000),TERM(vn:1227769140),WHERE)
F:= AND(F:0,TERM(t:-3 a:0),=,TERM(vn:300021),TERM(<null>))
F:= AND(F:0,TERM(t:-1 a:0),=,TERM(vn:300021),TERM(<null>))
F:= AND(F:0,TERM(t:-4 a:0),=,TERM(t:-1 a:1),TERM(<null>))
T:-2.FILTER(F:0)
A:-T:-2.ADD_COLUMN(TERM(t:-1 a:11),SUM,"ifintotaloctets","ALL")
A:-T:-2.ADD_COLUMN(TERM(t:-1 a:12),SUM,"ifouttotaloctets","ALL")
RESULT(T:-2)

2008-12-17 08:56:24 [3] Initial execution plan:
2008-12-17 08:56:24 [3] Cnd(0):   t:-3 a:0 BETvn:300021 AND vn:300021         (0)
2008-12-17 08:56:24 [3] Cnd(1):   t:-1 a:0 BETvn:300021 AND vn:300021         (0)
2008-12-17 08:56:24 [3] Cnd(2):   t:-4 a:0 BETvn:1227762000 AND vn:1227769140         (4.78)
2008-12-17 08:56:24 [3] Cnd(3):   t:-4 a:t:-1 a:1   (38.78)
2008-12-17 08:56:24 [3] Packs/packrows after KN evaluation:
2008-12-17 08:56:24 [3] (t0Pckrows2705susp2705 (empty 0 full). Packs opened in 1 cond.: 2705
2008
-12-17 08:56:24 [3] (t1Pckrows2susp(empty 0 full). Packs opened in 1 cond.: 1
2008
-12-17 08:56:24 [3] (t2Pckrows1susp(empty 0 full). Packs opened in 1 cond.: 1
2008
-12-17 08:56:24 [3] (t0-t2Pack2Pack pairs100possible.
2008-12-17 08:57:00 [3] Packrows after exact evaluation (WHERE):
2008-12-17 08:57:00 [3] (t0): 2705 all packrows1805 to open (including 0 full)
2008-12-17 08:57:00 [3] (t1): 2 all packrows1 to open (including 0 full)
2008-12-17 08:57:00 [3] (t2): 1 all packrows1 to open (including 0 full)
2008-12-17 08:57:00 [3] Traversed all 120 rows.
2008-12-17 08:57:03 [3] Produced 120 tuples.
2008-12-17 08:57:03 [3] Tuples after inner join 2-0 [hash]120
2008
-12-17 08:57:03 [3] Displaying result1 rows.
2008-12-17 08:57:03 [3] Total data packs actually loaded (approx.): 3430
2008
-12-17 08:57:03 [3] ---------------------------------------------------------------------------- 

NOTE: It seemed to take even longer this morning.. almost 40 seconds..  maybe it had to reload cache?

2. Second query (with no joins)

select sum(fact5.ifintotaloctets) AS "BytesInShown" sum(fact5.ifouttotaloctets) AS "BytesOutShown"
from fact fact5
where fact5
.timestamp_key between 1227762000 and 1227769140 
      
and fact5.ifpid 300021 ;

bh.err
2008
-12-17 09:00:27 [4] T:-TABLE_ALIAS(T:0)
T:-TMP_TABLE(T:-1)
F:CREATE_FILTER(T:-2,TERM(t:-1 a:1),BETWEEN,TERM(vn:1227762000),TERM(vn:1227769140),WHERE)
F:= AND(F:0,TERM(t:-1 a:0),=,TERM(vn:300021),TERM(<null>))
T:-2.FILTER(F:0)
A:-T:-2.ADD_COLUMN(TERM(t:-1 a:11),SUM,"ifintotaloctets","ALL")
A:-T:-2.ADD_COLUMN(TERM(t:-1 a:12),SUM,"ifouttotaloctets","ALL")
RESULT(T:-2)

2008-12-17 09:00:27 [4] Initial execution plan:
2008-12-17 09:00:27 [4] Cnd(0):   t:-1 a:0 BETvn:300021 AND vn:300021         (0)
2008-12-17 09:00:27 [4] Cnd(1):   t:-1 a:1 BETvn:1227762000 AND vn:1227769140         (16.27)
2008-12-17 09:00:27 [4] Packs/packrows after KN evaluation:
2008-12-17 09:00:27 [4] (t0Pckrows2705susp180 (2525 empty 0 full). Packs opened in 2 cond.: 181
2008
-12-17 09:00:28 [4] Packrows after exact evaluation (WHERE):
2008-12-17 09:00:28 [4] (t0): 2705 all packrows120 to open (including 0 full)
2008-12-17 09:00:28 [4] Displaying result1 rows.
2008-12-17 09:00:28 [4] Total data packs actually loaded (approx.): 0
2008
-12-17 09:00:28 [4] ---------------------------------------------------------------------------- 

Very fast.. < 1 sec

Since the first query had taken much longer than it had yesterday (~40sec compared to ~6sec) I ran it again.. 

2008-12-17 09:01:29 [5] Initial execution plan:
2008-12-17 09:01:29 [5] Cnd(0):   t:-3 a:0 BETvn:300021 AND vn:300021         (0)
2008-12-17 09:01:29 [5] Cnd(1):   t:-1 a:0 BETvn:300021 AND vn:300021         (0)
2008-12-17 09:01:29 [5] Cnd(2):   t:-4 a:0 BETvn:1227762000 AND vn:1227769140         (4.78)
2008-12-17 09:01:29 [5] Cnd(3):   t:-4 a:t:-1 a:1   (38.78)
2008-12-17 09:01:29 [5] Packs/packrows after KN evaluation:
2008-12-17 09:01:29 [5] (t0Pckrows2705susp2705 (empty 0 full). Packs opened in 1 cond.: 2705
2008
-12-17 09:01:29 [5] (t1Pckrows2susp(empty 0 full). Packs opened in 1 cond.: 1
2008
-12-17 09:01:29 [5] (t2Pckrows1susp(empty 0 full). Packs opened in 1 cond.: 1
2008
-12-17 09:01:29 [5] (t0-t2Pack2Pack pairs100possible.
2008-12-17 09:01:35 [5] Packrows after exact evaluation (WHERE):
2008-12-17 09:01:35 [5] (t0): 2705 all packrows1805 to open (including 0 full)
2008-12-17 09:01:35 [5] (t1): 2 all packrows1 to open (including 0 full)
2008-12-17 09:01:35 [5] (t2): 1 all packrows1 to open (including 0 full)
2008-12-17 09:01:35 [5] Traversed all 120 rows.
2008-12-17 09:01:35 [5] Produced 120 tuples.
2008-12-17 09:01:35 [5] Tuples after inner join 2-0 [hash]120
2008
-12-17 09:01:35 [5] Displaying result1 rows.
2008-12-17 09:01:35 [5] Total data packs actually loaded (approx.): 0
2008
-12-17 09:01:35 [5] ---------------------------------------------------------------------------- 

Back to around 6 seconds, presumably working from memory cache.

Does this provide clues for speeding up the join?
Cheers, Bob

Profile
 
Posted: 17 December 2008 11:50 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18
bostr - 17 December 2008 09:12 AM

Does this provide clues for speeding up the join?

Yes, thanks a lot. It’s exactly the part of code optimized recently, so you will see better performance in January release of ICE (I guess).
The reason of difference between these two cases is hidden in this line:

2008-12-17 09:00:27 [4] (t0Pckrows2705susp180 (2525 empty 0 full). Packs opened in 2 cond.: 181 

The one-table query got a huge benefit from knowledge nodes here: there is only 180 “suspected” data packs (i.e. data packs needed to be actually loaded/decompressed). The similar line in the slower case show that all 2705 packs are suspected, because ICE haven’t translate the dimension conditions into the fact table.

BTW, I am curious what time for these 177 mln rows may be regarded as satisfactory? (Do you have any comparison with other databases for the same data/queries?)

Regards,

[ Edited: 17 December 2008 11:54 AM by Jakub Wroblewski]
Signature 
Profile
 
Posted: 17 December 2008 12:38 PM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  4
Joined  2008-12-15

Hello Jakub

BTW, I am curious what time for these 177 mln rows may be regarded as satisfactory? (Do you have any comparison with other databases for the same data/queries?)

This particular query should run very fast, I think, since it is constrained to a narrow set of records from the fact table (1 identified object for only a narrow range of time keys) -  I would expect it to return consistently in well under a second. 
Also I would hope to see that query time does not increase significantly as the number of records in the fact table grows to many hundreds of millions / few billions.

Anyway, thank you for your help. I will look out for the January ICE release.

On the other question of concurrent query / load -  do you know if this is planned to be suppoted at some point soon?

Regards
Bob

Profile
 
Posted: 18 December 2008 03:49 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  651
Joined  2008-08-18

Hi Bob!!

I would expect it to return consistently in well under a second.

Is this expectation based on some previous experience ? I’m just curious if you are trying to catch up with the performance you had so far, or ICE is the first to deal with your data. You know, one thing is “fast for me”, another “faster than others” ... I guess you had used engines supporting indexing.

concurrent query / load - do you know if this is planned to be supported at some point soon

It is planned, but I cannot call it “soon”, i.e. within next month…
Regards,

Profile
 
Posted: 18 December 2008 10:07 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  4
Joined  2008-12-15

Hi

Is this expectation based on some previous experience ? I’m just curious if you are trying to catch up with the performance you had so far, or ICE is the first to deal with your data. You know, one thing is “fast for me”, another “faster than others” ... I guess you had used engines supporting indexing.

Well, in this case, ICE itself contributed to this expectation by giving sub-second response time for the 1-table query. The 3-table query was semantically equivalent yet took very significantly longer. I’m no expert in SQL query optimization, but it strikes me that this simple test indicates that the join operation is far too costly. In a more optimal scenario, the dimension table predicates would be used to confine the scope of the fact table retrieval, and avoid all those unnecessary fact table ‘data packs’ from being loaded/decompressed (a behaviour that will cause the query to degrade increasingly as the size of the fact table grows, despite the query itself only needing to access a fixed subset of records)

Also, I’m seeing very consistent sub-second reponses for this 3-table query on a commercial column store database I’ve been evaluating, though so far I have not managed to get similar performance from any opensource column store alternatives.

Do you think the fix, previously mentioned by Jakub on this thread, could theoretically bring performance of the 3-table query in line with the equivalent 1-table query?  I imagine that there will always be some overhead associated with runtime joins, but if the fix allows dimension predicates to properly limit the scope of the fact table data access, then the join query should be *much* faster, and (mostly) independent of the number of rows in the fact table? Right?

Cheers
Bob

Profile
 
Posted: 18 December 2008 10:14 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  651
Joined  2008-08-18

Hi !

We have seen some examples of queries, which when analyzed on paper seemed easy to be answered. Still some engines could not take them, apparently missing the necessary analysis. In this case Jakub’s fix does the required job. It recognizes, that the condition can be applied to the fact table only, not to the result of the join.

Thanks for your posts - give us a notice when you will be able to try the updated ICE version.

J.

Profile