I was recently corresponding with a customer who was looking to implement another project on Infobright. Our customer asked, paraphrasing, “Are you aware of any implementations where the designers have completely stopped using star schemas and slowly changing dimensions and instead store their data in a single Infobright table?” What is being referred to here is an entire data model that is denormalized, or flattened, down to a single table object for complete elimination of JOIN processing.
I responded:
In general, all data models should be designed to support the end user requirements. And in reality, many users’ requirements are framed by the tools they have chosen to use. In the case of many business intelligence (BI) tools, those requirements often appear to be dimensional in their reports and analyses. And that’s not unusual since that is how data is presented by the tool for user consumption. And for that reason, many BI environments are built on standard star schemas.
This is not to say that BI tools cannot adequately present data to users from a third normal form (3NF) data model, a star schema (denormalized dimensions), a snowflake (normalized) dimensional model or a single flat table model. But it is often more convenient for the BI tool to map facts and descriptive attributes to dimensional models. From a database perspective, the highest levels of performance can often be achieved when the database is asked to do the least amount of data manipulation beyond reading and filtering values. JOINs are a good example of this.
I have to tell you that Infobright has customers that have used all of the previously mentioned data models. In all cases, the choice of data model was primarily influenced by the end user requirements and secondarily by the overall management and maintenance complexity of alternative approaches.
* Where data is naturally created and collected in single flat files, they are often loaded into single “flat” data models and analytics are issued against them “as is”. This works because all of the values are static, are never UPDATEd, and there is little effort in loading them in that format. Think web logs, CDRs, and network switches.
* Where existing reports are migrated from application environments to segregate workloads, the data models often remain highly normalized and the reports are migrated with little to no modification. Think operational reporting systems.
* Where a choice of toolset imposes a data model - like MicroStrategy and a snowflake schema - this works, too. There is generally a high degree of effort to map source data to the model but after that it is highly flexible and allows complex analyses. Think high-end analytics and BI.
* Where the data is a collection of business “events” - movement of product, arrival of people or products, transactions, changes in status - that are not represented by a single log file and there are many types of events - with many more types of descriptors - must be captured and consolidated. That’s the easy part. The hard part is maintaining the descriptors as they change over time because they had one value for earlier events and another value for current or later events. This is your standard SCD. The traditional star schema lends itself very well to this administrative/maintenance task and still meets the BI requirements. Maintaining SCDs is almost impossible outside of some sort of dimensional model, star or snowflake. And it’s not really a data modeler’s choice. SCDs either exist in the systems you are modeling or they don’t. (Although choice of which type of SCD to maintain is possible, this is also often imposed by the users’ requirements.) If so, use a dimensional model.
I can’t say that I have observed anyone who has “stopped” using one data model or another. Nor have I seen migrations from one methodology to another.
Rather, the data models have simply fit the use case.
Hi David,
i was considering the same thing. Making one flat table instead of a star-scheme.
But i wondered : will filtering on a dimensiontable (e.g. product-brand) be faster on a small dimension table vs a large flat table?
Daan Blinde
let’s be honest: yes the datawarehouse is a central issue, but we all have a lot of tools with a lot of databases that are not part (or not yet part) of the datawarehouse.
and we suffer from the fact that the business buys tools which come as an all-in-one tool: frontend, rules, reporting, analysis all in one on the same database.
we continue to put more CPU and RAM inplace to cope with this because we have no choice in many cases: who has the money, the time and guts to touch that important production system?!
I think these type of tools are well known by all of us. a good way of approaching the issues around such an issue is to get the data out of the production system and into infobright.
this relieves the database (less data, smaller indexes) and relieves the CPU because reports and analyses can be done independent of the productive system.
now if you have a lot of such systems this should not take too long otherwise it costs too much and you never reach the end.
so a way to achieve this is to extract the data from the source, do a little bit of ETL maybe - e.g. add time related data such as year, quarter, week, etc - and then load it as one flat table into infobright.
this is fast, efficient and I think works for many standalone applications. adding the time attributes gives you the ability to have a history of data which production system often don’t have.
I believe flat tables in infobright are very sexy in a way that you spend less time designing a schema.
and yes SCD are an issue and it depends on where you work of how important it is to acurately track changes or history. but then again the standalone app, of which we have many, and which tends to grow and grow does not always need this.
I grew up in IT with designs and schemas and all this in mind, but lately I have very much learned again to admire the easiness of simple textfiles, tools such as awk, sed or grep and a simpel database such as mysql or infobright.
uwe geercken
Post Comment