Joinutility seperatorLogin utility separator Infobright.com
   
 
Really slow query.
Posted: 15 July 2010 11:07 AM   Ignore ]  
Newbie
Rank
Total Posts:  6
Joined  2010-02-02

I am trying to execute a query against a fact table of approximately 13 million rows.  The query joins in one of its dimensions and a third table that will be used to map that dimension from a slowly changing type 2 to a type 1 dimension.  Additionally, I am removing a degenerate dimension. 

I know it sounds complicated, but the query is pretty straight forward.  I originally was selecting into an outfile and let the query run for 12 hours before killing it.  I then attempted to select the first 100 rows using limit 100, and it ran for an hour before I killed it.

All fields in the fact table are integers, except the degenerate dimension that is being removed.  The fact table has roughly 13 million rows, the dimension has 250,000 rows, and the mapping table has 3200 rows.  Any recommendations on how to make this query execute would be appreciated.

SELECT `date_dimension_id`
    , `specialty_dimension_id`
    , `geo_dimension_id`
    , `max_ad_dimension_id`
    , `customer_dimension_id`
    , `page_dimension_id`
    , `domain_dimension_id`
    , `channel_dimension_id`
    , sum(`impression_count`)
    , sum(`click_count`)
  FROM `ad_performance`
  JOIN `ad_dimension` on `ad_dimension`.`id` = `ad_performance`.`ad_dimension_id`
  JOIN `ad_dim_map` on `ad_dim_map`.`ad_id` = `ad_dimension`.`ad_id`
GROUP BY `date_dimension_id`
    , `specialty_dimension_id`
    , `geo_dimension_id`
    , `ad_dimension_id`
    , `customer_dimension_id`
    , `page_dimension_id`
    , `domain_dimension_id`
    , `channel_dimension_id`
ORDER BY `date_dimension_id`
    , `specialty_dimension_id`
    , `geo_dimension_id`
    , `ad_dimension_id`
    , `customer_dimension_id`
    , `page_dimension_id`
    , `domain_dimension_id`
    , `channel_dimension_id`;

Profile
 
Posted: 16 July 2010 04:32 AM   Ignore ]   [ # 1 ]  
Member
Avatar
RankRankRank
Total Posts:  191
Joined  2008-08-18

Hi Charlie,

Sorry for trivial question, but are you sure all the tables were created for Brighthouse engine? You can easily check it by

show create table ad_performance

At the end you should see ‘engine=brighthouse’. Check every table.

Thanks,
Piotr

Profile
 
Posted: 16 July 2010 10:20 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-02-02

I am sure that all tables are NOT brighthouse.  The 3200 row mapping table ad_dim_map is MyISAM.  The fact table and dimension are both brighthouse.  Is it a requirement that all tables be brighthouse?

Profile
 
Posted: 16 July 2010 10:31 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  505
Joined  2008-08-18

If you want reasonable performance with JOIN, then all tables in the JOIN must be of the same engine type. You cannot reasonably expect to join a Brighthouse to a MyISAM.

Geoffrey

Signature 
Profile
 
Posted: 16 July 2010 10:32 AM   Ignore ]   [ # 4 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

Any SQL command that contains any non-BRIGHTHOUSE tables will be executed by the MySQL optimizer and execution path.

Signature 
Profile
 
Posted: 16 July 2010 12:20 PM   Ignore ]   [ # 5 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-02-02

I have made all tables involved brighthouse.  I launched the query again and it has been running for 1.5 hours so far and no results.  Any other recommendations on how to speed this up?

Profile
 
Posted: 16 July 2010 12:37 PM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

charlie,

What appears to me first is that there are no filter conditions on this query.  So, in effect, it is doing a full table scan of the fact table, assuming their are no orphaned records.

Is this typical of your environment or just a test?  If you were to apply a filter - especially on a column in the fact table - the Knowledge Grid could be utilized to reduce the overall I/O for the query.  No filters roughly equates to no leveraging of the Infobright metadata in the Knowledge Grid.  Are there any appropriate filters for your query under normal circumstances?

Second, clearly the GROUP BY and ORDER BY take a lot of work.  The entire query result must be decompressed and materialize in temp space, implicitly sorted for the GROUP BY on eight columns (necessary for the aggregations), and then explicitly sorted for the ORDER BY on the same eight columns.  Is the ORDER BY statement necessary?

You can verify that all of this is happening by monitoring your file system, especially the directory and file system used for the CacheFolder (defined in brighthouse.ini) and perhaps OS swap space, and running utilities like top, iostat and vmstat.

Signature 
Profile
 
Posted: 16 July 2010 01:00 PM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-02-02

The purpose of this query is to migrate data in a fact table.  The ad_dimension, referenced by the ad_performance fact,  was a slowly changing type 2 table.  It was determined that we don’t need that granularity of data and that it would be better to be a type 1.  In order to make this migration, I need to find the largest ad_dimension.id for each ad.  This is data that is contained in the map table.  After that, I need to replace all the existing ad_dimenion_ids in the fact table to the correct id from the map table.  The only way I know how to do that is to select the data and calculate the measures in the fact table based on the new grouping. 

All this data will be selected into a file and then loaded into a new fact table.

As for sorting, it is my understanding that it is better to have the data sorted before loading into a fact table. The typical query pattern against this fact is date, specialty, and geo based.  Therefore, I think that the sort is necessary.

BTW, I noticed a small error in the group and sort by clauses.  The ad_dimension_id in both clauses should be max_ad_dimension_id.  I made that change, and am re-running the query.  So far, no perceived performance increase from that change.

If there are filters that I could use that would help I would willing to try them.  I have used dummy filters in the past with some RDBMS systems and it improved performance, even though it made no sense.  i.e. “where date_dimension_id >= 1”.

[ Edited: 16 July 2010 01:02 PM by charlie]
Profile
 
Posted: 16 July 2010 01:16 PM   Ignore ]   [ # 8 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

charlie,

Data migration - that’s what I thought.  smile

Here’s a point of clarification:  Data is normally created chronologically and loaded in the same order.  And that’s beneficial for high-performance date-oriented queries.  But it’s not necessarily the sorted order that is beneficial to the I/O reduction achieved with the Knowledge Grid, it’s the grouping of identical values into contiguous storage.  Obviously, this happens with an ORDER BY clause but it also happens with a GROUP BY clause.  (In some databases, a GROUP BY performs an implicit ORDER BY.  This is not the case with Infobright.)

For your scenario I recommend that you break the export into a few pieces that can leverage the Knowledge Grid by adding either an explicit date_dimension_id or a range of values and remove the ORDER BY clause.  This should speed up your exports and allow you to load in date ordered fashion.

Hope that helps.

Signature 
Profile
 
Posted: 20 July 2010 04:35 PM   Ignore ]   [ # 9 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-02-02

After several attempts to split the data into manageable pieces using date ranges, I had to give up using the brighthouse engine.  A query of approximately 165,000 rows based on a date range took 3.5 hours, 850,000 rows took almost 20 hours.  If I assume a linear progression of performance, it will take me 15 days to extract the data for my migration.

As a work around, I created MyISAM versions of all three tables involved in the query.  Total time to populate the tables from the brighthouse tables was about 15 minutes.  It then only took 15 minutes to extract the data into a CSV file using the query above.  The total time for my data migration was just over 31 minutes, including the creation of all MyISAM tables, populating them from brighthouse, extracting the data to CSV, and reloading the data back into new brighthouse tables.  Much improved over 15 days.

I am not saying that I think that brighthouse tables are inferior to the MyISAM equivalent but they do provide much better performance for doing large data extractions.  In “normal” use of the fact table, query times in brighthouse tables are 100-200 times faster for me than the equivalent MyISAM table.  That being said, I am really disappointed in the inability to join three tables together, summing and grouping, over a fairly small dataset.

Profile
 
Posted: 21 July 2010 04:21 AM   Ignore ]   [ # 10 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi !

When having any kind of performance problems with Infobright and asking for an explanation, it is advisable to post
- table schemes
- query (this has been presented)
- err.log corresponding to the query execution.

The last point is the key one, see http://www.infobright.org/wiki/Query_Execution_Log/ or just search for “Control Messages” . With the log, it maybe be possible to identify e.g. some trivial errors or lack of some optimization very needed in your case. Anyway, It would be very useful to know what causes your query to be slow…

Profile