infobright.org
Joinutility seperatorLogin utility seperator Infobright.com

Data Warehousing

05
Jan

The Culture of Open Source

David Lutz's photo
by David Lutz     Mon, Jan 05, 2009

Not necessarily a data warehousing topic, per se, but it is relevant to Infobright as an open source database.

An interesting read on the open source culture and community - "The Pirate's Dilemma".

http://www.amazon.com/Pirates-Dilemma-Culture-Reinventing-Capitalism/dp/1416532188/ref=sr_1_27?ie=UTF8&s=books&qid=1231168447&sr=1-27

Infobright     Tags:

01
Jan

Agile for Data Warehouse Projects?

Victoria Eastwood's photo
by Victoria Eastwood     Thu, Jan 01, 2009

I’m a bit of an old war horse when it comes to project management. Having managed many projects over the years and as my boss would be happy to point out, having plenty of grey hair to show for it.

When I first started my career, projects were managed using the waterfall life cycle management. Of course some more successfully then others. I got quite good at it to tell you the truth. As the practice software engineering improved, iterative development approaches where introduced which I quickly applied to my projects. Following the Microsoft mantra of “get to a shippable state and keep there”; adding function and feature iteratively once a skeleton framework was established.

We apply Agile methods to our development process at Infobright. Applying Agile to the development of a database system is a bit of a struggle since the problems tend to be fairly large and rarely fit in one iteration so defining appropriate test points at the end of an iterations becomes a bit of a challenge. However, the benefits of Agile outweigh many of these struggles and we have been quite successful with its application.

Then I got to thinking about the various data warehouse projects I have worked on. Most of these were pretty mammoth scale projects both in the objective and the number of people involved. All followed some sort of waterfall approach to project management and all had issues around changing requirements, unknowns in source system data, etc. All projects suffered from date slippages and failed features that had to be re-implemented. Just for the record, I wasn’t managing the projects but working on requirements gathering and data warehouse architecture and design.

If you look at Agile and think carefully about what development problems it solves, it seems to me it’s a perfect fit for data warehousing applications. Each iteration can bring one more bit of data/subject area into the warehouse. Business can prioritize data/analysis needs and they can change them every month! The team can apply what it learns about the data over time rather than waiting until its completely implemented before figuring out the data quality is insufficient to meet the business requirement.

So what’s different about Agile relative to other iterative development approaches? Well, several seemingly minor things but they add up. Here’s a few of the more obvious.

  • Create a list of business functionality to be delivered by the data warehouse project and have the business prioritize it.
  • At the beginning of an iteration, select from a backlog of business functionality the items that deliver the most value to the business and can be completed in a month (if items are too big for a month, break them down into several smaller functional items.)
  • Time box the iteration to about a month (we follow calendar months).
  • Leave the development team alone during the iteration (i.e. don’t change the task list). If something disastrous happens, scrap the iteration and start over.
  • The development team should be cross functional, including developers (ETL and BI), QA, business analysis etc. Keep the team small, around 7 or so. Have the team members sign up to the tasks from the task list. Watch magic happen as the team solves their problems to meet the iteration goals.
  • At the end of the iteration, demonstrate the delivered functionality. The software should be shippable, so the business can choose if they would like to implement it into production.
  • Repeat the process, starting by re-prioritizing the backlog of business items and selecting the tasks for the next iteration


 Of course there are much more to Agile than I can put into a few bullet points. There are many resources on the web that can help you get started with Agile if you are interested (see http://agilemanifesto.org/ and http://www.agilealliance.org). I just find it curious that data warehouse projects don’t seem to follow the best practices in software engineering. Perhaps people don’t think of data warehousing as software development, since you generally implement using tools (but this is really just another type of programming isn’t it?)


29
Dec

Backing Up Infobright

John Kemp's photo
by John Kemp     Mon, Dec 29, 2008

Recently, I had some questions regarding backing up an ICE based DataWarehouse.  It reminded me of a posting that I made to a question posed in the forums.  

Just like any other database solution, there are challenges in backing up the database whilst still supporting the business.  I still remember transactional systems being down for ‘maintenance’ on a nightly basis while the data was being backed.  You don’t really have that opportunity these days - many businesses require 7/24 availability of key systems - but with data warehouses, there are some opportunities to implement a backup strategy.

Most data warehouses still utilize a batch processing window for the uploading of data, be it hourly, nightly, or (increasingly rarely) weekly.  These windows offer an ideal time for the backing up of the database’s data.

Organizations will usually take one of two approaches if they are backing up during the load window:

1. Back up the load files

The load file is backed up and archived.  The data is then loaded into the database. 
This is a quick method and is especially useful if processing windows are tight. 
However, if the data needs to be restored, then you are required to reload all the input files - time consuming and difficult.

2. Back up the target database

Load the input files. 
Backup the target database. 
This can be far more time consuming, depending upon the relative size of the target database to the source files. 
On the plus side, restoring the database is quite a bit less painful than reloading all the input files.

Given Infobright’s high compression rates, approach #2 is the one I prefer, since the backup times can be quite low due to the very high compression rates.

Now, if you are using your data warehouse in a near real time mode, with frequent updates to the data throughout the day, then a different approach will be needed as you don’t have the luxury of downtime to effect your backup.

The mysqldump utility may be used to dump both the schema and contents of tables to backup media.  Be careful when using this approach - mysqldump generates insert statements to replicate the data and this can result in backup files that are 10 to 20 times the size of the Infobright data files!  And you have the problem of not being able to load this data into ICE using the insert statements!!!

The data can be replicated to a backup solution using either a proxy server (mysql’s proxy server should be released in the near future) or using MySQL’s SQL replication (note:  the binary replication will not work).  DML statements can be trapped and redirected to a backup solution, where the database may either be replicated, or the DML activities recorded.  Again, the only DML statement that works in ICE is the LOAD DATA INFILE.

Finally, you can use a select ... into outfile statement to generate backups of the data.  The big advantage here:  the database can remain in use while you are dumping out the data.  The big disadvantage - backups can take a long time, backup data files can be ‘huge’, and the impact on database performance can be significant during the backup process.

Alternately, you can look at open source tools like Zmanda to help manage your backups.

So, what do you do?  I always recommend working with the business to determine what they really need, what they are willing to take on in terms of risk, and what 'inconvenience' they can accept in terms of down time or time to recover.  Then drive out the appropriate strategy and tools required from that.

 

Infobright     Tags:

Next Page