I would like to start a discussion with you on how you practically design a datawarehouse with infobright.
traditional star schema design is made up of dimensions and fact tables. diemsnions are the entry points into the fact tables and represents the basic views such as e.c. customer, location, product, date, time, etc. the fact table caries a reference to the dimension in form of a key.
in infobright theoretically - because of of the column orientation and the compression - you would not need dimension tables. instead of seperating the values in a dimension, one could integrate the data in each row. but then on the other hand, one would likely keep some dimensions apart from the fact tables such as an audit dimension.
could somebody elaborate on the pro’s and con’s of a design without dimensions? just to notice: we don’t have very large amounts of data so integrating additional dimension data would probably not be an issue.
thanks for your feedback.
uwe geercken
IT manager swissport international ltd., zurich
Please take a look into a discussion in the same thread: Does denormalization help or hurt query performance?
Using one denormalized fact table would be usually faster and in many situations it is worth considering. On the other hand, we may prefer dimension tables if there is more columns inside them - e.g. a customer with name, home address, birthdate etc. It is too space-consuming (and too error-prone) to repeat all such information for every fact regarding the customer.
For sure it is not worth to have a dimension table with one-two numerical columns (apart of the key), or even a char column (we may sometimes use ‘lookup’ modifier instead).
thank you for your response. I have seen the other thread.
for example the audit dimension, that I plan to create, I think I want to keep seperate from the rest. it is for the IT housekeeping, and the business user won’t be interested in that data/columns.
for other dimensions, I think I might integrate it into the fact tables: the ETL process should make sure that the data is consistant if reused for multiple (fact) tables.
one other question I have: when you are talking about lookups: what is the difference to a normal dimension table?
Lookup: it is a modifier of char/varchar column definition. It makes the column to use internal dictionary instead of storing explicit texts. The advantage is that we’re dealing internally with numbers (relatively small), which is faster. The limitation is that a number of distinct values in such a column we should not exceed several thousands (hard limit is 32k, but performance may degrade sooner). Typical examples of usage for such columns are names of countries, states, names of brands/models (if limited) etc.
If we declare a text column in fact table as ‘lookup’, we will not loose storage (which is one of reasons to use dimension tables).
In general, the best solution is to check out different schemas on a reasonably large test implementation.
This is an interesting discussion, and I will weigh in with my thoughts (purely from a design and usability perspective).
The star schema was really put forward as a way to support data warehousing in a relational database. Not that it doesn’t work - one of it’s core benefits is that it is relatively easy to understand. After all, a driving force in the delivery of DW was that your typical business layperson, using rudimentary query tools, couldn’t or didn’t have time to figure out a complex transactional model. Heck, even most IT guys had a hard time. And that drove long lead times to get any information out of an OLTP system. So the star schema approach was developed and it made life a lot easier.
That’s not to say that the star schema approach is perfect. I always get hung up on slowly changing dimensions. Using star schemas, we ended up with Type 2 and Type 3 and so on dimension tables. And guess what? Complexity starts to creep back in.
Uwe, your suggestion is one that I’ve seen gaining traction. Why not fully denormalize the model so that everything exists in a single table. There are some great benefits to this approach: first off, it’s very simple and easily understood since everything is there in one table—I don’t have to join to anything! And managing slowly changing dimensions is a non-issue, since I don’t have them.
Now, this approach is not without its drawbacks: in a traditional RDBMS solution, your rows get ‘huge’ and that has an impact on performance (not an issue with a column based DBMS, since you only retrieve the columns you care about). Storage requirements can be massive (again, not really an issue with Infobright due to our compression technology) since you end up with a lot of repetitive data.
One of the biggest challenges with a fully denormalized model is the ETL process - while you remove the complexity from the data model, you shift it over to the ETL process - you now have to manage multiple tables and ensure that all are consistent prior to writing out that one large row to the database!
Uwe, not sure if this helps. I am very interested in hearing your experiences if you choose to go down this path.
could somebody elaborate on the pro’s and con’s of a design without dimensions?
Hi,
Pros:
- Faster queries (no joins)
- Simpler database schema, simpler queries
- In some specific cases: smaller database size (no need for foreign keys)
Cons:
- Higher risk of inconsistent data. Every insert/update/delete must be aware that copies of a given value may (or must) exist in other rows of the table and all of them must be updated accordingly ...
- ... which also means slower insert/update/delete if many rows must be updated
- No simple way to store a ‘dimension’ row without any ‘fact’, e.g. keep info on a customer who is not involved in any ‘fact’ yet
- Usually bigger database size
Why not fully denormalize the model so that everything exists in a single table. ... And managing slowly changing dimensions is a non-issue, since I don’t have them.
John,
I agree with you in general, and specifically with Infobright, as one of the original design goals of the star schema, as you noted, was to assist analytical queries in a traditional row-oriented RDBMS. This doesn’t necessarily apply to column-oriented RDBMS and especially not to Infobright.
However, I would offer a slightly different opinion on slowly changing dimensions. SCDs don’t go away in a denormalized data model - only the dimension tables do. And one still needs a dimension table (or similar) in the staging database to build the denormalized fact records in the ETL process, again, as you noted.
Dimensional attributes still change.
If it’s just the value of an existing attribute that changes, that’s not so hard - update the staging table and build the new facts with the updated dimension values. But adding or deleting dimensional attributes can be exponentially harder to accomplish if they exist in a massive fact table. Perhaps even impossible for all practical purposes. This is another advantage of having the dimensional attributes segregated into individual “outrigger” tables.
And administrative updates and maintenance - spelling corrections, incorrect values, ETL bugs, etc. - can be prohibitive to perform if they must be applied to large fact tables.
In the end, though, the benefits might outweigh the costs. It’s a “big picture” decision that includes administrative, operational and performance consideration.
as most users will come from a traditional background of using a star schema, I believe this discussion is VERY important and I am missing it in the upfront information for the tool.
one of the first questions of most people will be: how do I design my tables? what happens with my dimensions and fact tables? of course, there is not a general way of doing it - but at least one should point it out, that denormalisation is a valid way one can go.
as we are in the middle of building our dwh, I am not completely sure if I have an overview of all requirements of the business. I see a point though with slowly changing dimensions. as indicated - for housekeeping purposes - I might want to set up an audit table/dimension to identify which record belongs to which load.
I have to play around a little bit and make some tests in the next days. and will keep you posted.
by the way - I have written a datagenerator, which creates mass ascii data based on word lists, random data or reg expressions. I am using this to generate mass data for testing the database. if anyone is interested, look at http://datamelt.com
as most users will come from a traditional background of using a star schema,
...
by the way - I have written a datagenerator, which creates mass ascii data based on word lists, random data or reg expressions. I am using this to generate mass data for testing the database. if anyone is interested, look at http://datamelt.com
rgds,
uwe
I am interested in a data generator that will generate ascii files without needing to connect to a database. I will definitely keep your link in my mind.
as most users will come from a traditional background of using a star schema, I
.
.
.
by the way - I have written a datagenerator, which creates mass ascii data based on word lists, random data or reg expressions. I am using this to generate mass data for testing the database. if anyone is interested, look at http://datamelt.com
rgds,
uwe
Uwe, I’m trying out your data generator and it looks good for generating numbers and characters. I’d like to know how to generate random dates or times within a specified range. For example, Inforbright dates can be from 100‐01‐01 to 9999‐12‐31, inclusive. How can I generate dates in that range with your generator?
you can use a regular expression to generate the date and/or time.
please look at the regular expression example in the tool or check on the internet if you are not so familiar with writing the regular expressions yourself.
I just figured that generating dates from regular expressions gets quite complicated.
so I will release a new version of the datagenerator tool the next days, which will include a new type besides the existing ones that allows the generation of date/time fields.
I have updated the datagenerator tool for generating mass data. additionally to the already existing abilities to generate random data, data based on word groups and data based on regular expression patterns, you can now specify a datetime format to generate easily date and/or time data. all can be output to the console or to a file in CSV or fixed length ascii.
you will need java to run it. and then you can generate mass data with millions of rows and see how infobright performs
go to my website at http://www.datamelt.com and download it from there. Under GPL and source is included.
Star Schema design is the simplest possible analytical schema. There are several reasons “Star Schema” is heavily promoted.
* It is the only scalable schema in case of “shared-nothing” massively parallel processing (MPP) architecture. In this case, the fact table is partitioned on a particulate field; different data partitions are stored on different servers; all dimension tables are cloned and cached at each server node.
* It is the only analytical schema type feasible in case of database deployment in the cloud (where “cloud” means Amazon cloud or any other cloud computing offering).
If your schema includes two or more fact tables, “shared-nothing” MPP approach is inherently non-scalable, because any time you have a complex join query involving both fact tables, which is not equijoin on the partitioning keys, you need to exchange data between partitions in order to evaluate join conditions during query execution. Thus, you will have n*(n-1)/2 intra-query data exchanges.
* In case of 50 nodes, 1225 data exchange links; in case of 200 nodes, 19900 data exchange links; in case of 1000 nodes, 499500 data exchange links.
* As all data exchanges need to be asynchronous, your MPP database server engine will need to employ tuple queues at each of the data exchanges. Thus, if it allocates, say, 512K for each incoming and outgoing tuple queue, then these tuple queues will require over 1GB of memory at each server node in case of 50 nodes; almost 20GB of memory at each server in case of 200 nodes; and almost 500GB of RAM at each server node in case of 1000 nodes.
So what do you do in case you have a more sophisticated data model and business intelligence and analytical query processing needs that cannot be represented with the vanilla “star schema”? You start using Infobright.