Joinutility seperatorLogin utility separator Infobright.com
   
 
optimizing join
Posted: 03 November 2009 07:49 PM   Ignore ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Hi,

For an analysis I have to perform a following join mentioned below -

SELECT a.AffiliateName, a.CustomerID, a.CustomerKey
from ActiveCustomerInfo a, ActiveCustomerInfoDup b
WHERE (a.CustomerKey != b.CustomerKey)
AND (a.EmailAddressKey = b.EmailAddressKey)
AND ((a.firstname != b.firstname AND a.lastname != b.lastname)
    OR (a.firstname != b.firstname AND a.lastname = b.lastname))
GROUP BY a.AffiliateName, a.CustomerID, a.CustomerKey

The table ActiveCustomerInfo and ActiveCustomerInfoDup are identical.
I could also do a self join if it would be faster.
The row count in the table is approximatley 37 million.

The table structure is -
CREATE TABLE `ActiveCustomerInfo` (
  `CustomerID` varchar(64) DEFAULT NULL,
  `CustomerKey` int(11) NOT NULL,
  `FirstName` varchar(255) DEFAULT NULL,
  `LastName` varchar(255) DEFAULT NULL,
  `EmailAddressKey` int(11) DEFAULT NULL,
  `AffiliateName` varchar(255) DEFAULT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1;

I could reduce the varchar length for firstName, lastName.
Currently it is taking a lot of time. At the time of writing this request it had already been well over 4 hours.

Please help me fine tune the sql.

Thanks a lot in advance,
Som

Profile
 
Posted: 03 November 2009 08:02 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  454
Joined  2008-08-18

Why not dedup nightly to ensure that you don’t introduce duplicate records.

Geoffrey

Signature 
Profile
 
Posted: 03 November 2009 08:15 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Thanks Geoffrey for the reply.
There are approximately 37 million record.
It is taking too long to execute. I think I am missing something and could make it run a lot faster.
It is one time request and hence we dont have a nightly process for it.

Thanks,
Somesh

Profile
 
Posted: 03 November 2009 08:25 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  454
Joined  2008-08-18

If the lastnames are never null, then this condition

((a.firstname != b.firstname AND a.lastname != b.lastname)
    OR (
a.firstname != b.firstname AND a.lastname b.lastname)) 

is equivalent to

a.firstname != b.firstname 

so try replacing this condition and see if performance improves.

[ Edited: 04 November 2009 10:12 AM by Geoffrey Falk]
Signature 
Profile
 
Posted: 04 November 2009 04:23 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Thanks for the reply.
Last name could be null.
Although I did reduce the number of records by selecting records which have same email address.
That reduced the row count from 37 million to 12 million but still the join does not happen. I am fearing I would get error Brighthouse out of resources as it could create a really big temp table.
Please continue sharing your thoughts.

Thanks again,
Som

Profile
 
Posted: 04 November 2009 04:28 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  651
Joined  2008-08-18

Hi !

CustomerID` varchar(64) - bigint would work must faster.
Anyway a join condition using “not equal” is costly.

And as Geoffrey mentioned, “a.firstname != b.firstname” seems to be equivalent to the whole “((a.firstname != b.firstname AND a.lastname != b.lastname) OR (a.firstname != b.firstname AND a.lastname = b.lastname))  ” condition.

Profile
 
Posted: 04 November 2009 05:21 AM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi Som,

Please try to get rid with OR, as it has always a performance impact in joins. Also, it would be helpful if you can show an Infobright query log for this query - it is located in bh.err file in your database directory, and it needs ControlMessages = 2 to be set (and uncommented) in brighthouse.ini configuration file.

Regards,

Signature 
Profile
 
Posted: 04 November 2009 10:16 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  454
Joined  2008-08-18

[deleted, please ignore]

[ Edited: 04 November 2009 10:29 AM by Geoffrey Falk]
Signature 
Profile
 
Posted: 04 November 2009 07:27 PM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Thanks for the reply.
I had taken out OR condition.
The sql now is SELECT a.AffiliateName, a.CustomerID, a.CustomerKey
FROM ActiveCustomerInfoFeed a, ActiveCustomerInfoFeed b
WHERE (a.EmailAddressKey = b.EmailAddressKey)
AND (a.CustomerKey != b.CustomerKey)
AND (a.firstname != b.firstname)
GROUP BY a.AffiliateName, a.CustomerID, a.CustomerKey
The table contains close to 13 million record.
It has become critical for us so I also opened a ticket. The number is 527765.
I apolozise if I am doing a cross posting.
Also please find the bh.err file in the attachment.

Thanks,
Som

File Attachments 
bh.zip  (File Size: 4KB - Downloads: 108)
Profile
 
Posted: 05 November 2009 01:43 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Thanks Som,

Unfortunately, the bh.err still does not contain an execution log for this query, because ControlMessages are set to 0. Make sure that this line is not commented in brighthouse.ini. Server restart is also needed.

However, what we can see in the logs is that the query is just crashing server.

Regards,

Signature 
Profile
 
Posted: 05 November 2009 01:52 AM   Ignore ]   [ # 10 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Thanks for the reply.
I appreciate your help.
I have further reduced the sql to - SELECT a.AffiliateName, a.CustomerID, a.CustomerKey
FROM ActiveCustomerInfoFeed a, ActiveCustomerInfoFeed b
WHERE (a.EmailAddressKey = b.EmailAddressKey)
AND (a.CustomerKey != b.CustomerKey)
AND (a.firstname != b.firstname)
GROUP BY a.AffiliateName, a.CustomerID, a.CustomerKey


Also trying to find tune the structure as much as possible. The current row count is 13 million. But still is taking too long. I have yet to see the results and usually it errors out after couple of hours or the interface just hangs.
Currently ETL is running so it is not possible for me to change the setting and restart the service.

Btw, do you think the sql is optimally written for Infobright. Could I try something else to get it going.
Thanks again,
Som

[ Edited: 05 November 2009 02:03 AM by som]
Profile
 
Posted: 05 November 2009 02:44 AM   Ignore ]   [ # 11 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi Som,

I don’t know if this query would be better formulated. If it is crashing, the long response time may be connected with automatic restarting of server or sth. similar.

Just as a quick try: as a GROUP BY without aggregations, it is equivalent to DISTINCT. Could you, please, try this (with logging enabled, if possible):

SELECT DISTINCT a.AffiliateNamea.CustomerIDa.CustomerKey 
FROM ActiveCustomerInfoFeed a
ActiveCustomerInfoFeed b 
WHERE 
(a.EmailAddressKey b.EmailAddressKey
AND (
a.CustomerKey != b.CustomerKey
AND (
a.firstname != b.firstname); 

And a bit changed, just to compare:

SELECT DISTINCT a.AffiliateNamea.CustomerIDb.CustomerKey 
FROM ActiveCustomerInfoFeed a
ActiveCustomerInfoFeed b 
WHERE 
(a.EmailAddressKey b.EmailAddressKey
AND (
a.CustomerKey != b.CustomerKey
AND (
a.firstname != b.firstname); 

Regards,

Signature 
Profile
 
Posted: 05 November 2009 10:11 AM   Ignore ]   [ # 12 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  454
Joined  2008-08-18

Hi,

It appears that you are trying to identify customers who have signed up more than once using the same email address. Am I correct?

You should see better performance if the data is sorted by EmailAddressKey. Maybe dump the data into a temporary table, sorted and then run the query against the sorted table.

Geoffrey

Signature 
Profile
 
Posted: 05 November 2009 03:26 PM   Ignore ]   [ # 13 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-07-10

Thanks Jakub and Geoffery.
I am trying the new sql.
Geoffery - I did sort the records based on emailaddress.
I will enable the logs as mentioned before as soon as I can restart the service which is not possible now as continuos load is going on.


Thanks for your help,
Somesh

Profile
 
   
 
 
‹‹ Install failed...      api supported ››