Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

04
Mar

Data Quality in Big Data

Jeff Kibler's photo
by Jeff Kibler     Fri, Mar 04, 2011

Earlier today on Twitter, Gartner analyst Ted Friedman tweeted:

is great, but want to hear about addressing in Big Data. Rather have GB of trusted data than a PB of crap"

Couldn't agree more.  Regardless of the data size, you fly blindly if the data doesn't actual represent reality.  There's a wonderful website that showcases all of the tragedies of poor dataquality in practice: http://www.iqtrainwrecks.com/.  These incidents among many other "non-reported" problems occur quite frequently.  It's up to the data manager to appreciate and avoid such mistakes.


As for Infobright, why does this matter?  Simply put, Infobright, just like mysql and other relationshal databases, is designed to handle your structured data, but it doesn't make any immediate assumptions about the dataset.  For example, if you assign an integer for the "age" field, it won't complain when you submit a negative value.  It also won't detect when a name is misspelled, address is wrong, or data is complete.  It's up to the due diligence of the data owner (which, I argue is everyone) to ensure data quality before, during, and after the database.  While constraints can help, databases should never be the extent of your data quality protection.  It's an end-to-end requirement.

When dealing with bigdata, it's even more important.  Granted, you may only use the data to point you in a direction, but many view their data as sacred.  If you're using the data to drive your business in any direction, enact proper data quality checks and verifications through the ETL, database, and retrieval processes.  For example, if you have a billion rows of input, did you verify you have a billion rows in your table?  When joining two tables, are they at least the same type? 

Companies love to tout their data size.  Infobright loves to help those companies mine their data.  But, their data means nothing unless it's quality.  A gallon of seawater does nothing for someone who's thirsty; it'll only gets him in trouble.

Please login or register to post a comment.