I have a request_date column in my fact table having values like ‘2009-11-28’.
I have tried the following ways to define my date dimension.
1. I am not using any separate date dimension table. I am directly using request_date column in fact table. It is very slow as i need to use date_format(request_date) function or year(request_date) or month(request_date) functions which are very slow.
2. I tried explicit year, month, day column in my fact tables which works fine for me. All these explicit columns are smallint, tinyint and tinyint type respectively. But at some places i need date like ‘Jan-2010’ etc which require date_format function which increases the response time almost 2 times.
3. Thirdly i tried a separate date dimension table with all the information pre calculated. Its response time is also similar to 2nd step’s 2nd scenario which increases the time 2 times.
Please suggest me what is the best way to define a date dimension?
Thanks for your valuable suggestion. Actually i have already gone through this document and i have already tried degenerated dimensions, separate dimension table and inline table but still the performance is not acceptable and it is in acceptable range if i use degenerated year, month, day column in fact details table. Let me explain it more clearly.
1. If i use fact_table with request_date column of type date.
for e.g.
This design does not have separate dimension table for date. In my queries i need year(request_date), month(request_date), day(request_date) which are slow and performance is not acceptable. Number of records in fact table is almost 300 million.
2. Then i moved to a separate dimension table and fact table.
for e.g.
Now it is giving acceptable performance. But at some places i need “Jan-09” or “01-Jan-09” etc. for which i need to use date_format function on request_date column which is again taking similar time as step 2 was taking which is not acceptable.
Can somebody help me how should i design my date dimension to get the acceptable performance with 300 million data in fact table?
You can calculate “Jan-09” or “01-Jan-09” using some sort of ETL process, and store it as a separate column in your fact table. Otherwise, you would have to maintain a separate date dimension table and incur the cost of a JOIN. These are the only alternatives, really.
But month and day varchar column are slower than tinyint columns. And its permormance is also similar to step 2. It means just changing the type of column reduces the performance. Do you have any idea?
Please investigate some ETL tools to add the additional data as denormalized columns. A suggestion of the ones we work well with are Pentaho Data Integration (PDI) (also known as Kettle), and Talend. Create a job to calculate the required date formats when you load the data.
The issues you are experiencing are confusing two points - and neither has to do with data loading. Those are data modeling and querying an Infobright dimensional model.
First, modeling. For over 15 years I have advocated creating the most robust date dimension table possible. Have as many columns of as many variations of the date and its derivatives in as many data types as necessary/possible. Your guide should be the queries themselves and what they need. Don’t force any functions or translations onto your tool or queries - perform the function or translation once and save it in the date dimension. That means if one tool or user needs ‘month’ as an integer, save it in an integer column/data type. If another needs the numeric value for ‘month’ represented as a text value, save it again in a CHAR(2) field. It doesn’t matter - in a traditional database or Infobright. There are only 366 days in a year, maximum. For a date dimension with even 100 years’ worth of data, that’s less than 37,000 records. And in Infobright, that’s less than one Data Pack. With compression, it’s nothing in total disk space.
Second, query performance. The issue is getting acceptable performance when querying the database, whether with a degenerate date dimension, which would be very hard to do with my advice above, or by joining to a separate robust date dimension, which you have also tried. But I suspect the issue is not the join itself. (It’s hard to say because the exact query is not provided.) I suspect the reason is because when joining to the separate dimension table, you are also filtering on the dimensional attributes directly and not on the fact table. The query will always have to open the single date dimension Data Pack(s); that’s a given. That won’t change no matter what you do. And it doesn’t affect performance. What can be affected is if you move at least one filter to the large fact table, possibly in addition to any other filters on the dimension table. The optimizer will not do this for you automagically, under the covers. One must apply a specific filter against the large table to allow the optimizer to create an efficient rough set approximation of the number of Data Packs to read from the large table or else it will read them all.
So you can return whatever value, string or concatenation you need form the date dimension table. But make sure that your filter is against whatever date value you use in the fact table to join to it (please make it a DATE or INT data type, though). This will improve performance.
Please give it a try and let us know how it turns out.
Following two queries are taking more than 2.5 minutes. Here i am using separate date dimension table. Total data in fact table is 236 million and date dimension have only 2000 records. None of the dimension table is too heavy. Should these queries take this much of time (2.5 minutes) with this volume of data (236 million in fact table)? Can you help me figure out the bottle neck in these queries?
Query1
SELECT `pt`.`id` AS `c0`, `dim_subscriptions`.`value` AS `c1`, `acc_g`.`type` AS `c2`, `dp`.`year_name` AS `c3`, `dp`.`month` AS `c4`, SUM(`fd`.`usage_count`) AS `m0` FROM `npg_page_type_view` AS `pt`, `fact_details` AS `fd`, `dim_subscriptions` AS `dim_subscriptions`, `npg_account_parent_view` AS `acc_g`, `dim_period` AS `dp` WHERE `fd`.`page_type` = `pt`.`id` AND `pt`.`id` IN (‘19’, ‘20’, ‘1506’) AND `fd`.`subscription` = `dim_subscriptions`.`value` AND `dim_subscriptions`.`value` = ‘subscribed’ AND `fd`.`account_id` = `acc_g`.`child_id` AND `acc_g`.`type` = ‘Group’ AND `fd`.`request_date` = `dp`.`date_id` AND `dp`.`year_name` = 2010 AND `dp`.`month` = 1 GROUP BY c0, c1, c2, c3, c4;
Query2
SELECT `pt`.`id` AS `c0`, `dim_subscriptions`.`value` AS `c1`, `acc_g`.`type` AS `c2`, `acc_g`.`parent_id` AS `c3`, `dp`.`year_name` AS `c4`, `dp`.`month` AS `c5`, SUM(`fd`.`usage_count`) AS `m0` FROM `npg_page_type_view` AS `pt`, `fact_details` AS `fd`, `dim_subscriptions` AS `dim_subscriptions`, `npg_account_parent_view` AS `acc_g`, `dim_period` AS `dp` WHERE `fd`.`page_type` = `pt`.`id` AND `pt`.`id` IN (‘19’, ‘20’, ‘1506’) AND `fd`.`subscription` = `dim_subscriptions`.`value` AND `dim_subscriptions`.`value` = ‘subscribed’ AND `fd`.`account_id` = `acc_g`.`child_id` AND `acc_g`.`type` = ‘Group’ AND `acc_g`.`parent_id` IN (‘ABM’, ‘ACA’, ‘AHCL’, ‘AISTI’, ‘AMIGOS’, ‘ANKOS’, ‘BCR’, ‘BDINASP’, ‘BIBSAM’, ‘BICFB’, ‘BMVEL’, ‘BOINASP’, ‘BON’, ‘BRIN’, ‘BUGALICIA’, ‘CALIS’, ‘CAPCON’, ‘CAPES’, ‘CARLI’, ‘CARNet’, ‘CAS’, ‘CAUL’, ‘CBUC’, ‘CCHC’, ‘CDL’, ‘CFL’, ‘CGINASP’, ‘CHEST’, ‘CHML’, ‘CIBER’, ‘CIC’, ‘CILEA’, ‘CIPE’, ‘CNR’, ‘CONCERT’, ‘CONICYT’, ‘COPPUL’, ‘COUPERIN’, ‘CREPUQ’, ‘CSIR’, ‘CUNY’, ‘DEF’, ‘DELCON’, ‘DFG’, ‘DST’, ‘ECINASP’, ‘EEBSCOArch’, ‘EIRA’, ‘ELNET’, ‘EPSCOR’, ‘ETINASP’, ‘FGT’, ‘FINELIB’, ‘FLIN’, ‘FMHS’, ‘FORSA’, ‘GASCA’, ‘GASCG’, ‘GASCO’, ‘GASCS’, ‘GHINASP’, ‘HELIN’, ‘HGF’, ‘HHS’, ‘HNINASP’, ‘HUN’, ‘ICELAND’, ‘ICMR’, ‘IELC’, ‘IMLC’, ‘INASP’, ‘INDEST’, ‘INFLIBNET’, ‘INISCI’, ‘IRI’, ‘JMLA’, ‘JNUL’, ‘JPLA’, ‘JRC’, ‘KEINASP’, ‘KEO’, ‘KESLI’, ‘LC’, ‘LRLC’, ‘MALC’, ‘MALMAD’, ‘MEINBRE’, ‘MGINASP’, ‘MINITEX’, ‘MIR’, ‘MIRACL’, ‘MNINASP’, ‘MOBIUS’, ‘MSKCC’, ‘MWINASP’, ‘MZINASP’, ‘NDS’, ‘NEICON’, ‘NERL’, ‘NESLI’, ‘NIINASP’, ‘NJKI’, ‘NPINASP’, ‘NTSL-A’, ‘OCUL’, ‘OHIONET’, ‘PALINET’, ‘PASCAL’, ‘PC’, ‘PHECP’, ‘PPV’, ‘PUL’, ‘REMBA’, ‘RESCOLINC’, ‘RWINASP’, ‘SASLI’, ‘SCAMEL’, ‘SCELEC’, ‘SES’, ‘SSCM’, ‘SUL’, ‘SVS’, ‘SWE’, ‘TRI-COLLEGE’, ‘TRLN’, ‘TZINASP’, ‘UCC’, ‘UGINASP’, ‘UNA’, ‘USDOE’, ‘UTD’, ‘UTSC’, ‘VALE’, ‘VDIC’, ‘VIVA’, ‘VNINASP’, ‘ZMINASP’, ‘ZWINASP’) AND `fd`.`request_date` = `dp`.`date_id` AND `dp`.`year_name` = 2010 AND `dp`.`month` = 1 GROUP BY c0, c1, c2, c3, c4, c5;
Following two queries are taking more than 2.5 minutes. Here i am using separate date dimension table. Total data in fact table is 236 million and date dimension have only 2000 records. None of the dimension table is too heavy. Should these queries take this much of time (2.5 minutes) with this volume of data (236 million in fact table)? Can you help me figure out the bottle neck in these queries?
arunsinghal01,
Starting with the first query, I have reformatted it here for clarity:
SELECT `pt`.`id` AS `c0`, `dim_subscriptions`.`value` AS `c1`, `acc_g`.`type` AS `c2`, `dp`.`year_name` AS `c3`, `dp`.`month` AS `c4`, SUM(`fd`.`usage_count`) AS `m0` FROM `npg_page_type_view` AS `pt`, `fact_details` AS `fd`, `dim_subscriptions` AS `dim_subscriptions`, `npg_account_parent_view` AS `acc_g`, `dim_period` AS `dp` WHERE `fd`.`page_type` = `pt`.`id` AND `pt`.`id` IN (‘19’, ‘20’, ‘1506’) AND `fd`.`subscription` = `dim_subscriptions`.`value` AND `dim_subscriptions`.`value` = ‘subscribed’ AND `fd`.`account_id` = `acc_g`.`child_id` AND `acc_g`.`type` = ‘Group’ AND `fd`.`request_date` = `dp`.`date_id` AND `dp`.`year_name` = 2010 AND `dp`.`month` = 1 GROUP BY c0, c1, c2, c3, c4;
There are many things to consider in this first query. The first is that there is no filter on the fact_details table so I suspect that all Data Packs for the ‘usage_count’ field are being read and decompressed. Only by supplying the relevant lines from the log file, bh.err, (after setting ControlMessages=2 in brighthouse.ini and restarting the Infobright service), will tell for sure. This point was made in my previous reply:
What can be affected is if you move at least one filter to the large fact table, possibly in addition to any other filters on the dimension table.
But there is more:
- 5 table join - this is more than simply joining the fact table to the date dimension
- 5 column GROUP BY - this forces a 5-way sort operation and is quite possibly where the bulk of this operation’s time is spent
- 10 character text matching on dim_subscriptions - this may or may not be an issue depending on the dimension’s size
- 5 character text matching on acc_g - this may or may not be an issue depending on the table’s size
Your best chances for performance improvement lie in moving or duplicating one or more of the most restrictive dimension attributes into the fact_details table and applying a filter to that column - perhaps the year name and month (both appear to be numeric and are likely also contiguous).
Following two queries are taking more than 2.5 minutes. Here i am using separate date dimension table. Total data in fact table is 236 million and date dimension have only 2000 records. None of the dimension table is too heavy. Should these queries take this much of time (2.5 minutes) with this volume of data (236 million in fact table)? Can you help me figure out the bottle neck in these queries?
arunsinghal01,
Same comments and suggestions for the second query as the first. Plus the existence of 138 character strings (byte-by-byte character matching) in an IN clause for ‘acc_g.parent_id’ in the WHERE clause.
SELECT `pt`.`id` AS `c0`, `dim_subscriptions`.`value` AS `c1`, `acc_g`.`type` AS `c2`, `acc_g`.`parent_id` AS `c3`, `dp`.`year_name` AS `c4`, `dp`.`month` AS `c5`, SUM(`fd`.`usage_count`) AS `m0` FROM `npg_page_type_view` AS `pt`, `fact_details` AS `fd`, `dim_subscriptions` AS `dim_subscriptions`, `npg_account_parent_view` AS `acc_g`, `dim_period` AS `dp` WHERE `fd`.`page_type` = `pt`.`id` AND `pt`.`id` IN (‘19’, ‘20’, ‘1506’) AND `fd`.`subscription` = `dim_subscriptions`.`value` AND `dim_subscriptions`.`value` = ‘subscribed’ AND `fd`.`account_id` = `acc_g`.`child_id` AND `acc_g`.`type` = ‘Group’ AND `acc_g`.`parent_id` IN (‘ABM’, ‘ACA’, ‘AHCL’, ‘AISTI’, ‘AMIGOS’, ‘ANKOS’, ‘BCR’, ‘BDINASP’, ‘BIBSAM’, ‘BICFB’, ‘BMVEL’, ‘BOINASP’, ‘BON’, ‘BRIN’, ‘BUGALICIA’, ‘CALIS’, ‘CAPCON’, ‘CAPES’, ‘CARLI’, ‘CARNet’, ‘CAS’, ‘CAUL’, ‘CBUC’, ‘CCHC’, ‘CDL’, ‘CFL’, ‘CGINASP’, ‘CHEST’, ‘CHML’, ‘CIBER’, ‘CIC’, ‘CILEA’, ‘CIPE’, ‘CNR’, ‘CONCERT’, ‘CONICYT’, ‘COPPUL’, ‘COUPERIN’, ‘CREPUQ’, ‘CSIR’, ‘CUNY’, ‘DEF’, ‘DELCON’, ‘DFG’, ‘DST’, ‘ECINASP’, ‘EEBSCOArch’, ‘EIRA’, ‘ELNET’, ‘EPSCOR’, ‘ETINASP’, ‘FGT’, ‘FINELIB’, ‘FLIN’, ‘FMHS’, ‘FORSA’, ‘GASCA’, ‘GASCG’, ‘GASCO’, ‘GASCS’, ‘GHINASP’, ‘HELIN’, ‘HGF’, ‘HHS’, ‘HNINASP’, ‘HUN’, ‘ICELAND’, ‘ICMR’, ‘IELC’, ‘IMLC’, ‘INASP’, ‘INDEST’, ‘INFLIBNET’, ‘INISCI’, ‘IRI’, ‘JMLA’, ‘JNUL’, ‘JPLA’, ‘JRC’, ‘KEINASP’, ‘KEO’, ‘KESLI’, ‘LC’, ‘LRLC’, ‘MALC’, ‘MALMAD’, ‘MEINBRE’, ‘MGINASP’, ‘MINITEX’, ‘MIR’, ‘MIRACL’, ‘MNINASP’, ‘MOBIUS’, ‘MSKCC’, ‘MWINASP’, ‘MZINASP’, ‘NDS’, ‘NEICON’, ‘NERL’, ‘NESLI’, ‘NIINASP’, ‘NJKI’, ‘NPINASP’, ‘NTSL-A’, ‘OCUL’, ‘OHIONET’, ‘PALINET’, ‘PASCAL’, ‘PC’, ‘PHECP’, ‘PPV’, ‘PUL’, ‘REMBA’, ‘RESCOLINC’, ‘RWINASP’, ‘SASLI’, ‘SCAMEL’, ‘SCELEC’, ‘SES’, ‘SSCM’, ‘SUL’, ‘SVS’, ‘SWE’, ‘TRI-COLLEGE’, ‘TRLN’, ‘TZINASP’, ‘UCC’, ‘UGINASP’, ‘UNA’, ‘USDOE’, ‘UTD’, ‘UTSC’, ‘VALE’, ‘VDIC’, ‘VIVA’, ‘VNINASP’, ‘ZMINASP’, ‘ZWINASP’) AND `fd`.`request_date` = `dp`.`date_id` AND `dp`.`year_name` = 2010 AND `dp`.`month` = 1 GROUP BY c0, c1, c2, c3, c4, c5;