Users of Business Intelligence (BI) tools and analytical databases, such as data warehouses and data marts, are not interested in reviewing, trending, graphing or manipulating large amounts of consistent data - whether that consistency is represented by a repetitive list of a constant value or a consistently random set of data.
In the case of constancy, there is little value to be gained from a BI tool over a simple, naïve query, and little business or process insight to be gained from what everyone already knows anyway (“it’s all the same”). Likewise, a set of consistently random data is, by definition, just “noise” and provides even less insight.
Business Intelligence users and analysts are looking for trends, patterns and anomalies in their data.
• North American grocery stores do sell more eggnog in December.
• Some retail sales promotions do (and some don't) drive increased store traffic.
• Home heating oil consumption does increase during winter in colder climates.
• More television viewers do watch sports specials than the daily newscasts.
• Marketing campaigns can drive increased traffic to a particular web site.
These are the patterns in real business, system and process data that are of interest to real users and therefore the environment upon which BI software should be benchmarked. This is in stark contrast to other benchmarks that purposely use randomized and/or randomly generated data that is more fitting for a physical test of hardware systems into which the data is loaded than the BI tools that will be used.
We all know that data doesn’t get created, captured and stored in smooth, consistently random fashion, so the BIPI doesn’t propose a test for those tools where the first and most fundamental choice for the test environment – the data – is not aligned with users’ realities. And as the name suggests, this is a performance-oriented index, not strictly, or even mostly, a functionally-oriented set of tests but must represent real-word BI functional use cases.
Simply put, the data matters.
Stay tuned for The BIPI - Part II, The Data
Comments (0)
There was a recent blog posting that pointed out that ICE and IEE for that matter, are not ACID compliant. Which got me thinking. My first thought was about my failed attempt to follow my father's footsteps and become a chemist. After some initial attempts at experiments, I realized that I was more like my mother -- couldn't measure anything if my life depended on it. Making me a cook not a chemist. But I digress...
My second thought was, what is really required for data warehousing?
Certainly, on the various data warehousing implementation I've worked on, no one cared about these features and in fact referential integrity features were "turned off". Two reasons, as I recall, first the data was not very clean and second it slowed loading down considerably.
On the first note about dirty data, this is a common problem in data warehousing. One hopes that their source systems ensure the data is clean but this is rarely the case (with or without ACID). Fact table records that don't match dimension tables is a common problem and leave the user screaming for LEFT OUTER JOIN.
On loading, in most cases, the batch window is extremely strained with little time to complete the daily work load. Since logging and referential integrity checks reduce the efficiency of load; most people turn it off.
Let's look at ACID a bit more closely.
Atomicity -- this refers to the all or nothingness of a transaction. In an OLTP system, a group of changes over several tables needs to be either all successfully applied or all not. In data warehousing, generally you are loadinga bunch of records in one table. Basically, you want to ensure that you know the state of what was loaded if there is a failure but there are many scenarios where you would want it to partially load. So you know where to restart the load and ensure what ever was loaded and committed is correct. This is a slightly different use case. Single table at a time. Task is to get all the data loaded and if it happens in batches as long as you know the state at any moment everything is cool.
Consistency -- This refers to a consistent state before and after a transaction. In the case of data warehousing, the consistent state is not achieved after each load but after the entire set of load operations is complete. But still, after the load, you may still have dirty data conditions as mentioned above. So what does consistency really mean? Consistency has a different use case in this environment.
Isolation -- now this is an interesting point. In some data warehousing environment, people want to read through locks -- the fresher the data the better and a very different requirement than in the OLTP world.
Durability -- obviously once you have loaded you want to make sure it persists, so nothing new here.
One of our design objectives at Infobright is to come up with new approaches to handling the data warehouse problem. We believe that there needs to be some strong discussion around how and if ACID compliance should be applied. So the question is "What does ACID really mean for data warehousing?"