infobright.org
Joinutility seperatorLogin utility separator Infobright.com

Data Warehousing

29
Dec

Backing Up Infobright

John Kemp's photo
by John Kemp     Mon, Dec 29, 2008

Recently, I had some questions regarding backing up an ICE based DataWarehouse.  It reminded me of a posting that I made to a question posed in the forums.  

Just like any other database solution, there are challenges in backing up the database whilst still supporting the business.  I still remember transactional systems being down for ‘maintenance’ on a nightly basis while the data was being backed.  You don’t really have that opportunity these days - many businesses require 7/24 availability of key systems - but with data warehouses, there are some opportunities to implement a backup strategy.

Most data warehouses still utilize a batch processing window for the uploading of data, be it hourly, nightly, or (increasingly rarely) weekly.  These windows offer an ideal time for the backing up of the database’s data.

Organizations will usually take one of two approaches if they are backing up during the load window:

1. Back up the load files

The load file is backed up and archived.  The data is then loaded into the database. 
This is a quick method and is especially useful if processing windows are tight. 
However, if the data needs to be restored, then you are required to reload all the input files - time consuming and difficult.

2. Back up the target database

Load the input files. 
Backup the target database. 
This can be far more time consuming, depending upon the relative size of the target database to the source files. 
On the plus side, restoring the database is quite a bit less painful than reloading all the input files.

Given Infobright’s high compression rates, approach #2 is the one I prefer, since the backup times can be quite low due to the very high compression rates.

Now, if you are using your data warehouse in a near real time mode, with frequent updates to the data throughout the day, then a different approach will be needed as you don’t have the luxury of downtime to effect your backup.

The mysqldump utility may be used to dump both the schema and contents of tables to backup media.  Be careful when using this approach - mysqldump generates insert statements to replicate the data and this can result in backup files that are 10 to 20 times the size of the Infobright data files!  And you have the problem of not being able to load this data into ICE using the insert statements!!!

The data can be replicated to a backup solution using either a proxy server (mysql’s proxy server should be released in the near future) or using MySQL’s SQL replication (note:  the binary replication will not work).  DML statements can be trapped and redirected to a backup solution, where the database may either be replicated, or the DML activities recorded.  Again, the only DML statement that works in ICE is the LOAD DATA INFILE.

Finally, you can use a select ... into outfile statement to generate backups of the data.  The big advantage here:  the database can remain in use while you are dumping out the data.  The big disadvantage - backups can take a long time, backup data files can be ‘huge’, and the impact on database performance can be significant during the backup process.

Alternately, you can look at open source tools like Zmanda to help manage your backups.

So, what do you do?  I always recommend working with the business to determine what they really need, what they are willing to take on in terms of risk, and what 'inconvenience' they can accept in terms of down time or time to recover.  Then drive out the appropriate strategy and tools required from that.

 

Infobright     Tags:

16
Dec

Has the Star Schema data model’s time passed?

John Kemp's photo
by John Kemp     Tue, Dec 16, 2008

Recently, I posted the following to a question about Star Schema model in a forum:

The star schema was really put forward as a way to support data warehousing in a relational database.  Not that it doesn’t work - one of it’s core benefits is that it is relatively easy to understand.  After all, a driving force in the delivery of DW was that your typical business layperson, using rudimentary query tools, couldn’t or didn’t have time to figure out a complex transactional model.  Heck, even most IT guys had a hard time.  And that drove long lead times to get any information out of an OLTP system.  So the star schema approach was developed and it made life a lot easier.

That’s not to say that the star schema approach is perfect.  I always get hung up on slowly changing dimensions.  Using star schemas, we ended up with Type 2 and Type 3 and so on dimension tables.  And guess what?  Complexity starts to creep back in.

The 'flat file' approach is one that I’ve seen gaining traction.  Why not fully denormalize the model so that everything exists in a single table.  There are some great benefits to this approach: first off, it’s very simple and easily understood since everything is there in one table—I don’t have to join to anything!  And managing slowly changing dimensions is a non-issue, since I don’t have them.

Now, this approach is not without its drawbacks:  in a traditional RDBMS solution, your rows get ‘huge’ and that has an impact on performance (not an issue with a column based DBMS, since you only retrieve the columns you care about).  Storage requirements can be massive (again, not really an issue with Infobright due to our compression technology) since you end up with a lot of repetitive data.

One of the biggest challenges with a fully denormalized model is the ETL process - while you remove the complexity from the data model, you shift it over to the ETL process - you now have to manage multiple tables and ensure that all are consistent prior to writing out that one large row to the database!

My esteemed colleague, David Lutz, wrote the following response:

I agree with you in general, and specifically with Infobright, as one of the original design goals of the star schema, as you noted, was to assist analytical queries in a traditional row-oriented RDBMS.  This doesn’t necessarily apply to column-oriented RDBMS and especially not to Infobright.

However, I would offer a slightly different opinion on slowly changing dimensions.  SCDs don’t go away in a denormalized data model - only the dimension tables do.  And one still needs a dimension table (or similar) in the staging database to build the denormalized fact records in the ETL process, again, as you noted.

Dimensional attributes still change.

If it’s just the value of an existing attribute that changes, that’s not so hard - update the staging table and build the new facts with the updated dimension values.  But adding or deleting dimensional attributes can be exponentially harder to accomplish if they exist in a massive fact table.  Perhaps even impossible for all practical purposes.  This is another advantage of having the dimensional attributes segregated into individual “outrigger” tables.

And administrative updates and maintenance - spelling corrections, incorrect values, ETL bugs, etc. - can be prohibitive to perform if they must be applied to large fact tables.

In the end, though, the benefits might outweigh the costs.  It’s a “big picture” decision that includes administrative, operational and performance consideration.

So, are Star Schema models dead?  Not really, as there are absolutely cases where a modified star schema can be of value, as David has described. 

...and it's really all about good design!

 

Infobright     Tags:

Next Page