Joinutility seperatorLogin utility separator Infobright.com
   
 
SQL Join between 2 tables hanging?
Posted: 12 December 2009 03:42 PM   Ignore ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Hi,

I loaded 2 tables, t1 and t2.

t1 has about 75MM rows with 6 INT columns and t2 has about 325MM rows with 150+ columns but most of them except 2 are ‘lookup’ columns.

Queries individually on either of these tables work fine. But now I am trying to join these 2 tables based on one column and the query seems to be hanging. I have waited for about couple of hours (twice) and killed the process.

Here is the query: SELECT * FROM t1 a JOIN t2 b ON (a.id = b.id) LIMIT 1

Any ideas?

-Venks

Profile
 
Posted: 12 December 2009 07:17 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

ICE does not optimize LIMIT 1 operation, so your query performs the whole join and then displays the first row of a multi-GB result. Please try to limit tables by WHERE, or use COUNT(*), depending on what do you actually want to obtain from your data.

Regards,

Signature 
Profile
 
Posted: 12 December 2009 07:53 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Hi,

Thanks for your quick reply. I actually tried the following query first, but killed the session after an hour or so.

SELECT COUNT(*) FROM t1 a JOIN t2 b ON (a.id = b.id) ;

When the above did not work, I checked the documention and I found the following in the user guide.


==================================================================

Creating Pack/Pack Knowledge Nodes

Pack/Pack Knowledge Nodes are automatically created by most of joining queries. However,
if you anticipate that two tables will commonly be used together in queries, you can improve
query performance by manually creating a Pack/Pack Knowledge Node for a pair of
columns from the two tables.
- To create a Pack/Pack Knowledge Node, execute a joining query without any additional constraints.

For example:

Select * from table1 join table2 on a1=a2 limit 1;

Important: Avoid use of “Select count(*) from table1 join table2 on a1=a2” since various
optimizations will bypass Knowledge Node creation.

=================================================================

Based on the above information, I tried the LIMIT1 query.


Please let me know if you need additional info.

-Venks

Profile
 
Posted: 13 December 2009 03:25 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

venks - 12 December 2009 07:53 PM

- To create a Pack/Pack Knowledge Node, execute a joining query without any additional constraints.

For example:

Select * from table1 join table2 on a1=a2 limit 1;

I see. Well, joining tables of this size may take some time, but it’s better to check whether there is no error there. Would you, please, provide us with your bh.err log file for this query? Instructions how to create it:

http://www.infobright.org/wiki/Query_Execution_Log/

BTW, in current versions of ICE (3.2 and newer) there is no need to precompute Pack2Pack nodes.

Regards,

Signature 
Profile
 
Posted: 14 December 2009 10:21 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Jakub,

Last night I fired the query and after 12 hours it has done nothing. So I think the query is hanging. Please find attached the bh.err file.

FYI, EVENTS_X is “t1” and INDIVIDUAL_DEMOGRAPHICS is “t2” in my sample query. I didn’t use the actual tables in the query I mentioned in this thread earlier.

-Venks

File Attachments 
bh.err.txt  (File Size: 3KB - Downloads: 288)
Profile
 
Posted: 14 December 2009 10:36 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Thanks. The log looks suspicious: joining didn’t even started… It may be a bug. Could you, please, do some more tests (with logging)? E.g. limit both tables (WHERE) to just a small subset of rows, and then do the join?
Two other questions:
- what hardware are you using (RAM etc.)?
- what are types of columns used in join?

Regards,

Signature 
Profile
 
Posted: 14 December 2009 11:40 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Jakub,

If I include another WHERE condition the query works fine. I tried with 3 different WHERE conditions on 3 different columns and all of them came back fine. So I think it is a bug and for the testing I am doing, I need to do simple JOIN on few such tables.

I installed ICE 3.2.2 on Cent OS 64 bit with 2 CPU (Quad Core) and 12GB of RAM.

Let me know if you need any additional information.

-Venks

Profile
 
Posted: 14 December 2009 12:12 PM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

Bh.err file for a successful run may be helpful.
For 12 GB RAM, your brighthouse.ini settings are a bit risky (swapping may easily occur). Please try ServerMainHeapSize between 6000 and 8000.

Regards,

Signature 
Profile
 
Posted: 14 December 2009 12:59 PM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Please find attached a “bh.err” file from a successful run on few queries with a WHERE condition.

Also, I updated the heap size as per your suggestion and I don’t think it helped the issue at hand.

-Venks

File Attachments 
bh.err.2.txt  (File Size: 10KB - Downloads: 316)
Profile
 
Posted: 14 December 2009 01:59 PM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

By extrapolation, the COUNT(*) join without WHERE should take 10-20 minutes. Apparently it is not the case, so probably we have a bug here. Just in case - please try COUNT(*) query and check whether it is anything nontrivial in bh.err file after, say, 30 minutes. (What I mean by “nontrivial”: e.g. messages like “Traversed 40265317/62368113 rows” may give us a clue about total time).

Regards,

Signature 
Profile
 
Posted: 14 December 2009 04:10 PM   Ignore ]   [ # 10 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Hi,

I tried running the query without WHERE condition and I don’t think ICE is doing anything. I don’t see any memory/swap/disk activity at all. Following is the last few lines from “bh.err” after running the query for about an 1h and 10m. Please let me know how to proceed from here.

================================================================================
2009-12-14 14:25:33 [1] T:-1 = TABLE_ALIAS(T:0,“events_X”)
T:-2 = TMP_TABLE(T:-1)
T:-3 = TABLE_ALIAS(T:1,“individual_demographics”)
T:-2.JOIN(T:-3)
A:-1 = T:-2.ADD_COLUMN(<null>,COUNT,“count(*)”,“ALL”)
VC:-2.0 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:4))
VC:-2.1 = CREATE_VC(T:-2,PHYS_COL(T:-3,A:1))
F:0 = CREATE_FILTER(T:-2,VC:-2.0,=,VC:-2.1,<null>,WHERE)
T:-2.FILTER(F:0)
RESULT(T:-2)

2009-12-14 14:25:33 [1] Initial execution plan (non-join):
2009-12-14 14:25:33 [1] Packs/packrows after KN evaluation:
2009-12-14 14:25:33 [1] (t0) Pckrows: 1177, susp. 1177 (0 empty 0 full). Packs opened in 0 cond.: 0
2009-12-14 14:25:33 [1] (t1) Pckrows: 4968, susp. 4968 (0 empty 0 full). Packs opened in 0 cond.: 0
2009-12-14 14:25:33 [1] Packrows after exact evaluation (WHERE):
2009-12-14 14:25:33 [1] (t0): 1177 all packrows, 1177 to open (including 1177 full)
2009-12-14 14:25:33 [1] (t1): 4968 all packrows, 4968 to open (including 4968 full)
2009-12-14 14:25:33 [1] Join execution plan:
2009-12-14 14:25:33 [1] Cnd(0):  VC:0(t0a4) = VC:1(t1a1)    (33.72)
================================================================================

Profile
 
Posted: 16 December 2009 06:26 AM   Ignore ]   [ # 11 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi Venks,

Is it any possibility that we could get your data for tests? Looks like a bug.

We actually would need just “id” columns, so the data transferred should have reasonable size, and no sensitive data would be involved.

Regards,

[ Edited: 16 December 2009 06:31 AM by Jakub Wroblewski]
Signature 
Profile
 
Posted: 16 December 2009 10:32 AM   Ignore ]   [ # 12 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-12-07

Sure. I think I can send the ID data. But before I do that please check out the following:

All the following queries do not work. I tried to trick the optimizer to include a WHERE condition but in vain.

SELECT COUNT(*) FROM t1 a JOIN t2 b ON (a.id2 = b.id2) ;
SELECT COUNT(id0) FROM t1 a JOIN t2 b ON (a.id2 = b.id2) ;
SELECT COUNT(*) FROM t1 a JOIN t2 b ON (a.id2 = b.id2) WHERE 1 = 1 ;
SELECT COUNT(*) FROM t1 a JOIN t2 b ON (a.id2 = b.id2) WHERE COALESCE(b.col001,‘x’) IS NOT NULL ;


Following query would work fine (Includes a real WHERE condition)

SELECT COUNT(*) FROM t1 a JOIN t2 b ON (a.id2 = b.id2) WHERE b.colXXX = ‘<some value>’ ;

Please find attached the table definition of the 2 tables with simple column names for your reference. I am wondering whether you can generate the data yourself by creating an incremental sequence for the Primary Keys and some random integer data for “id” columns. Let me know your thoughts and how to proceed from here.

File Attachments 
tables_ib3.txt  (File Size: 16KB - Downloads: 314)
Profile
 
Posted: 16 December 2009 11:02 AM   Ignore ]   [ # 13 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Thanks a lot, we can identify the problem based on your descritpion.

Regards,

Signature 
Profile
 
Posted: 16 December 2009 11:29 AM   Ignore ]   [ # 14 ]  
Member
RankRankRank
Total Posts:  106
Joined  2008-08-18

Hi venks,

I create a defect in our bugs report system:

http://bugs.infobright.org/ticket/1774

Thanks

Profile