Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

21
Oct

Extract Transform Load - What Not To Do…From Someone Who has Done Them All!!!

John Kemp's photo
by John Kemp     Tue, Oct 21, 2008

So, in giving further thought to ETL tools, I though it worthwhile to revisit some of the things that I've done before that you really shouldn't do.  These were learned in a long, hard, and painful way.  So, an opportunity for you to learn from the errors of my ways!  Avoiding these 'Don'ts' can help you implement your ETL process more quickly and relatively painlessly (sounds like my doctor before he gives me a shot).

 

  • Don't Assume It Will be Easy...

 

...because it won't be!  Even the simplest of ETL processes can be quite complex when you factor in data dependencies, operational hardening, and the like.  One of the biggest mistakes made in a DW project is underestimating the amount of time that will be required to complete the ETL process.  I still remember getting the transformations completed on time, only to think through a whole pile of what if scenarios, that resulted in a doubling of the time required to actually finish the ETL process, and putting the project 2 months late!  Not a lot of fun!
  • Don't Assume You Won't Have Operational Issues...
...you will!  I remember an ETL process failing because someone took a router out of service - unfortunately, none of us found out about it until we determined that the DW had not been updated.  Ensure that whatever ETL process you put into place generates notifications on any break or fail in the process, so that corrective action can be implemented.  Most ETL tools today provide this functionality.  
  • Don't Assume You Don't Need Business Input
We've all made the mistake of going to the business, getting the transformation requirements, and then bolting from them to get work done.  After all, we don't want them interfering with our critical development work!  In reality, you generate a far better ETL product by involving the business every step of the way.  A good business user can help define escalations on errors, error handling, notifications, and can help identify additional transformations that are required - this helps avoid expensive and time consuming rework when issues are encountered.
  • Don't Fix Quality Issues in the ETL
I'm not talking about determining householding or reconciling addresses across multiple systems here.  I am talking about fixing blatantly incorrect information in the source system.  You know the stuff; it's when Fred in HR asks you to correct those incorrect province codes coming out of some antiquated HR solution -- this is a simple example, but when you think about making more extensive changes to data, you can run into a number of problems: first off, you change wrong data to another type of wrong data; secondly, assumptions you make today may not hold true down the road, resulting in incorrect data!  And in my experience, you often spend as much time 'painting over' the data defects as you would fixing the source problem in the first place!
  • Don't Underestimate the Value of an ETL Tool
Hey, you've only got one feed your bringing in daily.  And it's pretty much a 1 to 1 transformation.   Dimensions are once a month at best.  So, why bother with an ETL tool?  I can bang out the code quickly without having to install the ETL tool and we're done!  Well, you still have hardening.  And some other poor schmuck may have to try and understand what you've written when you're no longer there. And guess what, you build a DW that users love...and want more.  And now you've got a whole pile of loads and transformations to do - so you keep adding them.  And then you end up with an unsupportable mess.  I suggest taking a good hard look at the ETL tools out there and using one unless you are absolutely sure that what you've got is what you're going to have!  The Open Source offerings have come forward a big way in their functionality and can help out.  ETL tools give you documentation, built in hardening, error handling, and monitoring. Do yourself a favor and look at them.
  • Don't Assume that You Know Your Data
Hey, if you are like me, you probably do know what you're data structures are.  However, you can't be sure what others are putting into it (of course, unless you are a one person shop).  More and more, we're utilizing tools like XML to bring in data from other systems, often without knowing if the data is correct!  Do yourself a favour: don't assume you know the data; assume you don't and treat your transformations and loads in the same way - it will save you grief down the road!
 

 In my next blog: enough about the Don'ts, and on to the Do's

Infobright     Tags: data, dw, etl, source

Maybe this is more specific advice, but I think it will be helpful to those interested:

1) Profile your source data. Often using   any good profiling tool will quickly allow you to zero in on out of spec data.

2) Often, creating tables that mirror the source files can help in profiling the source data—and using a database as a source can allow you to quickly test your jobs to ensure accurate transforms without getting stuck in a 3 hour processing run and finding a bug 2 1/2 hours in.

Date: 11/11/08

there is a good book: “The Data Warehouse ETL Toolkit” from Ralph Kimball and Joe Caserta. ISBN=0-7645-6757-8. it explains all around ETL with practical examples and problem solving patterns.

Author: uwegeercken
Date: 11/07/08

Please login or register to post a comment.