There was a recent blog posting that pointed out that ICE and IEE for that matter, are not ACID compliant. Which got me thinking. My first thought was about my failed attempt to follow my father's footsteps and become a chemist. After some initial attempts at experiments, I realized that I was more like my mother -- couldn't measure anything if my life depended on it. Making me a cook not a chemist. But I digress...
My second thought was, what is really required for data warehousing?
Certainly, on the various data warehousing implementation I've worked on, no one cared about these features and in fact referential integrity features were "turned off". Two reasons, as I recall, first the data was not very clean and second it slowed loading down considerably.
On the first note about dirty data, this is a common problem in data warehousing. One hopes that their source systems ensure the data is clean but this is rarely the case (with or without ACID). Fact table records that don't match dimension tables is a common problem and leave the user screaming for LEFT OUTER JOIN.
On loading, in most cases, the batch window is extremely strained with little time to complete the daily work load. Since logging and referential integrity checks reduce the efficiency of load; most people turn it off.
Let's look at ACID a bit more closely.
Atomicity -- this refers to the all or nothingness of a transaction. In an OLTP system, a group of changes over several tables needs to be either all successfully applied or all not. In data warehousing, generally you are loadinga bunch of records in one table. Basically, you want to ensure that you know the state of what was loaded if there is a failure but there are many scenarios where you would want it to partially load. So you know where to restart the load and ensure what ever was loaded and committed is correct. This is a slightly different use case. Single table at a time. Task is to get all the data loaded and if it happens in batches as long as you know the state at any moment everything is cool.
Consistency -- This refers to a consistent state before and after a transaction. In the case of data warehousing, the consistent state is not achieved after each load but after the entire set of load operations is complete. But still, after the load, you may still have dirty data conditions as mentioned above. So what does consistency really mean? Consistency has a different use case in this environment.
Isolation -- now this is an interesting point. In some data warehousing environment, people want to read through locks -- the fresher the data the better and a very different requirement than in the OLTP world.
Durability -- obviously once you have loaded you want to make sure it persists, so nothing new here.
One of our design objectives at Infobright is to come up with new approaches to handling the data warehouse problem. We believe that there needs to be some strong discussion around how and if ACID compliance should be applied. So the question is "What does ACID really mean for data warehousing?"
Test
Post Comment