Joinutility seperatorLogin utility separator Infobright.com
   
 
Using AND in LEFT JOINs
Posted: 08 April 2009 05:11 PM   Ignore ]  
Newbie
Rank
Total Posts:  8
Joined  2009-03-20

I’m trying to grab some trend data by hours of day.  I have a fact table that has a reference to a TimeDimension.  The TimeDimension has one entry for every minute of the day, plus an entry for “unspecified” (1441 entries in all).  The TimeDimension looks like this:

CREATE TABLE `TimeDimension` (
  `timeID` int(11) NOT NULL,
  `hour` tinyint(4) NOT NULL,
  `minuteOfHour` tinyint(4) NOT NULL,
  `minuteOfDay` smallint(6) NOT NULL,
  `timestamp` time NOT NULL,
  `halfDayOfDay` tinyint(4) NOT NULL,
  `hourOfHalfDay` tinyint(4) NOT NULL,
  `halfOfHour` tinyint(4) NOT NULL,
  `quarterOfHour` tinyint(4) NOT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1;

The fact table also has a column for organization, so I can constrain queries to organizations.  I want to get the number of facts for each hour of the day for a particular organization, so I have a query:

SELECT t.hour, count(c.factID)
FROM TimeDimension t
LEFT JOIN Facts c
ON c.initiatedTime=t.timeID
AND c.org=1234
GROUP BY t.hour
ORDER BY t.hour;

This returns:
ERROR 5 (HY000): The query includes syntax that is not supported by the Infobright Optimizer. Either restructure the query with supported syntax, or enable the MySQL Query Path in the brighthouse.ini file to execute the query with reduced performance.

I have an identical structure implemented using MyISAM for performance comparisons and I get the following from that:
mysql> select t.hour, count(c.factID)
from TimeDimension t
LEFT JOIN Facts c on c.initiatedTime=t.timeID
and c.org=1234
group by t.hour
order by t.hour;
+———+—————————-+
| hour | count(c.factID) |
+———+—————————-+
|  -1 |            0 |
|  0 |            2 |
|  1 |            0 |
|  2 |            2 |
|  3 |            0 |
|  4 |            0 |
|  5 |            0 |
|  6 |            0 |
|  7 |            0 |
|  8 |            1 |
|  9 |            0 |
|  10 |            1 |
|  11 |            7 |
|  12 |            3 |
|  13 |            6 |
|  14 |            2 |
|  15 |          10 |
|  16 |          25 |
|  17 |            8 |
|  18 |            3 |
|  19 |            1 |
|  20 |            2 |
|  21 |            1 |
|  22 |            3 |
|  23 |            0 |
+———+—————————-+
25 rows in set (0.00 sec)

If I move the c.org=1234 into the WHERE clause then I don’t get the hours of the day with a count() of 0, but the query works on both Brighthouse and MyISAM:

mysql> select t.hour, count(c.factID)
from TimeDimension t
LEFT JOIN Facts c on c.initiatedTime=t.timeID
where c.uoid=1234
group by t.hour
order by t.hour;
+———+—————————-+
| hour | count(c.FTCallID) |
+———+—————————-+
|  0 |            2 |
|  2 |            2 |
|  8 |            1 |
|  10 |            1 |
|  11 |            7 |
|  12 |            3 |
|  13 |            6 |
|  14 |            2 |
|  15 |          10 |
|  16 |          25 |
|  17 |            8 |
|  18 |            3 |
|  19 |            1 |
|  20 |            2 |
|  21 |            1 |
|  22 |            3 |
+———+—————————-+
16 rows in set (0.00 sec)


I need to return one row per hour for the graphs to work, otherwise I end up with missing values on the X-axis of the graph.  Any ideas how to get this data from Brighthouse?

Profile
 
Posted: 08 April 2009 05:53 PM   Ignore ]   [ # 1 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-03-20

I just checked this on Oracle too ... I have the same schema there.  It supports the AND in the LEFT JOIN and returns the columns correctly.  Moving the c.org to the WHERE clause doesn’t return the hours with count() = 0.

Looks like this is simply a limitation of the Brighthouse query optimizer.

Profile
 
Posted: 08 April 2009 06:17 PM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello barrycoleman,

I’m sure you’ll receive far better answer by tomorrow. For now, let me try with the following rewrite. Please let me know whether it helps. In the meantime, I certainly agree that we need to take a closer look at this limitation in ICE.

SELECT h.hourx.count
FROM
   
(SELECT DISTINCT hour FROM TimeDimensionh
LEFT JOIN 
   
(SELECT t.hour as hourcount(c.factID) as count
    FROM TimeDimension t
    INNER JOIN Facts c
    ON c
.initiatedTime=t.timeID
    WHERE c
.org=1234
    GROUP BY t
.hourx
ON x
.hour=h.hour
ORDER BY h
.hour

I know it looks awful. I hope I didn’t make any mistake…

Best greetings,

Dominik

Signature 
Profile
 
Posted: 08 April 2009 06:38 PM   Ignore ]   [ # 3 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-03-20

Thanks Dominik.  That works perfectly.  I agree it’s a little ugly, but I was working towards the same thing.  Of course, the count column returns NULL instead of 0 for hours without an entry.  I modified your query to account for that with:

SELECT h.hour, IF(x.count IS NULL,0,x.count) as count
FROM 
  
(SELECT DISTINCT hour FROM TimeDimensionh
LEFT JOIN
  
(SELECT t.hour as hourcount(c.factID) as count
   FROM TimeDimension t
   INNER JOIN Facts c
   ON c
.initiatedTime=t.timeID
   WHERE c
.org=1234
   GROUP BY t
.hourx
ON x
.hour=h.hour
ORDER BY h
.hour

Brighthouse is about twice as fast as MyISAM with this query on a 5.6M row Facts table (0.2secs vs. 0.4secs).  Interestingly, Oracle takes 26.4 seconds for this query.

[ Edited: 08 April 2009 06:51 PM by barrycoleman]
Profile
 
Posted: 09 April 2009 03:56 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-03-20

Found out why Oracle was taking 26 seconds.  In MySQL a KEY is an INDEX.  In Oracle it isn’t.  So even though a column was a foreign key constraint it wasn’t indexed.  Once I created a key for the column it was only 3 times slower than Brighthouse.

Profile
 
Posted: 09 April 2009 02:30 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello barrycoleman,

I’m glad that it worked out! We’ll be always happy to hear from you.

Actually, I’d be quite interested in how ICE/MySQL/Oracle speed evolves when data grows.

It would be nice to look at the data volumes and the numbers of distinct organizations in the data…

Whenever you have a minute, would you mind letting us know whether you considered such scenarios?

Best greetings,

Dominik

Signature 
Profile
 
Posted: 09 April 2009 03:24 PM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-03-20

Yeah, we have 778 organizations with the total number of entries in this fact table per organization ranging from 1.77 million (1/4 of the dataset) to 1.  Only 13 of 778 organizations have more than 65,000 facts.  It’s the usual distribution 20% of the organizations account for 80% of the data.

I’ve pulled only a small fraction of our data over (3 months worth for one feature). I’m planning to grab some more to see how performance is impacted by data growth.  I have 5.6million rows right now and plan to add another 15 million or so over the next few days.

I’ll let you know how it goes.

Profile
 
Posted: 09 April 2009 03:28 PM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18
barrycoleman - 09 April 2009 03:24 PM

I’ll let you know how it goes.

Thanks, I’ll be looking forward. It’s really interesting case.

Best greetings,

Dominik

Signature 
Profile