Joinutility seperatorLogin utility separator Infobright.com
   
 
Slowly changing dimensions
Posted: 16 March 2010 01:57 PM   Ignore ]  
Newbie
Rank
Total Posts:  6
Joined  2010-03-10

How does one manage type II slowly changing dimensions in ICE? The old dimensional row needs to be closed off (in my case, end-dated), and the new row inserted. In addition, we need to link the facts to the dimension surrogate keys, and I can’t see how this can be done without traditional DML.

On a tangent, I have read that dimensional modelling is less important with ICE, since it is an abstraction that isn’t required in a performant, column-based db. Regardless, in our case we still need to be able to have version of a relational equivalent of a dimension.

Profile
 
Posted: 16 March 2010 02:09 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  505
Joined  2008-08-18

Hi,

You generate an image of the entire dimension, the way you want it to look after the update, into a CSV file. Then drop and reload the table.

This is usually not so bad because dimensions tend to not be as large as your main fact table. Still, they can get quite large.

Example:
1) Select all the rows that have not changed (based on some criteria, JOIN or ETL stage) and all closed products

SELECT p.product_idp.product_namep.start_dtp.end_dt FROM product_dim p
JOIN current_product_list cp on p
.product_id cp.product_id where end_dt IS NULL
UNION ALL SELECT p
.product_idp.product_namep.start_dtp.end_dt FROM product_dim p
 WHERE end_dt IS NOT NULL
INTO OUTFILE 
'/tmp/unchanged.csv' fields terminated by '\t' enclosed by '"'

2) Select all the rows that need to be closed out (based on not being found in current_products)

SELECT p.product_idp.product_namep.start_dtcurrent_timestamp FROM product_dim p
LEFT OUTER JOIN current_product_list cp on p
.product_id cp.product_id WHERE cp.product_id IS NULL
INTO OUTFILE 
'/tmp/dropped.csv' fields terminated by '\t' enclosed by '"'

3) Recreate product_dim (empty)

CREATE TABLE product_dim_tmp LIKE product_dim;
DROP TABLE product_dim;
CREATE TABLE product_dim LIKE product_dim_tmp;
DROP TABLE product_dim_tmp

4)

LOAD DATA INFILE '/tmp/dropped.csv' INTO product_dim
 fields terminated by 
'\t' enclosed by '"'

LOAD DATA INFILE '/tmp/unchanged.csv' INTO product_dim
 fields terminated by 
'\t' enclosed by '"'

5) Load any new products that weren’t previously in
Hope this helps.

Geoffrey

Signature 
Profile
 
Posted: 16 March 2010 02:42 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-03-10

Thanks for your fast reply, Geoffrey. I kind of figured something similar would be required, but I’m hoping to confirm a set of best practices from members of this forum.

So assuming that product id is a surrogate key, I’ll need a way to link the fact table to the dimension. i.e. - the incoming fact data uses the natural key product name, but we’ll need to use product id. Is preprocessing the incoming file and adding the surrogate key prior to loading the recommended approach?

Dan

Profile
 
Posted: 16 March 2010 02:59 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  505
Joined  2008-08-18

Yes, normally you would load the new stuff into a staging table, join back on the natural key to generate the surrogate keys, and then load the fact table from that.

One challenge is how to generate the surrogate keys, since ICE doesn’t (yet) support the AUTO_INCREMENT attribute. Most of our customers have been using an ETL tool for that, such as Kettle or Talend.

Geoffrey

Signature 
Profile
 
Posted: 16 March 2010 03:48 PM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  6
Joined  2010-03-10

Hi Geoffrey.

This is my normal approach in a traditional rdbms environment. But aren’t we in a catch 22 with ICE? Once we’re in the staging table, we can’t add the surrogate keys since there is no way to update the data once we’ve loaded it (as there is no DML support). I’m guessing that the approach would need to be something like

1. Bulk load the data into a staging table
2. Dump a join back to the dimension that includes the id - what’s the best approach for this?
3. Bulk load the dumped data into the fact table.

My earlier suggestion of pre-processing the original file, would be something like

1. Outside of the DB, load the dimension into a memory structure, say a hash or associative array, indexed by the natural key
2. Read though the file, lookup the surrogate key, generating a new copy of the file that includes the surrogate key
3. Bulk load the data into the fact table.

I’m not sure how scalable this approach would be, however. It would assume that the dimensions are small, and that reading and rewriting the fact file would be a reasonably quick / low-resource.

Dan

Also, I’m making an assumption that the ETL is hand-coded, and i not using an ETL tool

Dn

Profile
 
Posted: 16 March 2010 04:14 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  505
Joined  2008-08-18

An alternative is to generate the surrogate keys using AUTO_INCREMENT on a MyISAM table, then copy the data over to an Infobright (Brighthouse) table.

Geoffrey

Signature 
Profile