I guess it's fair to say that I'm outsourcing this blog post to Roland Bouman and Jos van Dongen who have just published a great new open source software (OSS) business intelligence (BI) book called "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL".
I ordered my copy from Amazon.com; I'd bet it's also available at your favorite bookstore by now, too. While the OSS BI market has multiple participants - and Infobright has integrated and partnered with most, if not all, of them - this particular book happens to serve as documentation for Pentaho but it can aslo be considered an excellent reference on the topics of BI application development and data warehousing in general.
It's clear when an author, or authors, speak from experience. This paragraph illustrated that to me:
"Over the past decade, open source variants of more and more types of software have become commonly accepted and respected alternatives to their more costly and less flexible proprietary counterparts. The fact that software is open source is often mistaken for being free of cost, and though that might be true if you only look at the license costs, a BI solution cannot (and never will) be free of cost. There are costs associated with hardware, implementation, maintenance, training and migration, and if this is all summed up it turns out that licenses make up only a small portion of the total lifecycle cost of any software solution. Open source, however is much more than a cheaper way of acquiring software. The fact that the source code is freely available to anyone ensures better code quality since it is more likely that bugs are found when more people have access to the source than just the core developers. The fact that open source software is built on open standards using standard programming languages (mostly Java) makes it extremely flexible and extensible. And the fact that most open source software is not tied to a particular operating system extends this flexibility and freedom even further."
Thanks Roland and Jos!
Comments (2)
I was recently corresponding with a customer who was looking to implement another project on Infobright. Our customer asked, paraphrasing, "Are you aware of any implementations where the designers have completely stopped using star schemas and slowly changing dimensions and instead store their data in a single Infobright table?" What is being referred to here is an entire data model that is denormalized, or flattened, down to a single table object for complete elimination of JOIN processing.
I responded:
In general, all data models should be designed to support the end user requirements. And in reality, many users' requirements are framed by the tools they have chosen to use. In the case of many business intelligence (BI) tools, those requirements often appear to be dimensional in their reports and analyses. And that's not unusual since that is how data is presented by the tool for user consumption. And for that reason, many BI environments are built on standard star schemas.
This is not to say that BI tools cannot adequately present data to users from a third normal form (3NF) data model, a star schema (denormalized dimensions), a snowflake (normalized) dimensional model or a single flat table model. But it is often more convenient for the BI tool to map facts and descriptive attributes to dimensional models. From a database perspective, the highest levels of performance can often be achieved when the database is asked to do the least amount of data manipulation beyond reading and filtering values. JOINs are a good example of this.
I have to tell you that Infobright has customers that have used all of the previously mentioned data models. In all cases, the choice of data model was primarily influenced by the end user requirements and secondarily by the overall management and maintenance complexity of alternative approaches.
* Where data is naturally created and collected in single flat files, they are often loaded into single "flat" data models and analytics are issued against them "as is". This works because all of the values are static, are never UPDATEd, and there is little effort in loading them in that format. Think web logs, CDRs, and network switches.
* Where existing reports are migrated from application environments to segregate workloads, the data models often remain highly normalized and the reports are migrated with little to no modification. Think operational reporting systems.
* Where a choice of toolset imposes a data model - like MicroStrategy and a snowflake schema - this works, too. There is generally a high degree of effort to map source data to the model but after that it is highly flexible and allows complex analyses. Think high-end analytics and BI.
* Where the data is a collection of business "events" - movement of product, arrival of people or products, transactions, changes in status - that are not represented by a single log file and there are many types of events - with many more types of descriptors - must be captured and consolidated. That's the easy part. The hard part is maintaining the descriptors as they change over time because they had one value for earlier events and another value for current or later events. This is your standard SCD. The traditional star schema lends itself very well to this administrative/maintenance task and still meets the BI requirements. Maintaining SCDs is almost impossible outside of some sort of dimensional model, star or snowflake. And it's not really a data modeler's choice. SCDs either exist in the systems you are modeling or they don't. (Although choice of which type of SCD to maintain is possible, this is also often imposed by the users' requirements.) If so, use a dimensional model.
I can't say that I have observed anyone who has "stopped" using one data model or another. Nor have I seen migrations from one methodology to another.
Rather, the data models have simply fit the use case.