Joinutility seperatorLogin utility separator Infobright.com
   
 
Does denormalization help or hurt query performance?
Posted: 29 August 2008 10:12 AM   Ignore ]  
Jr. Member
Avatar
RankRank
Total Posts:  87
Joined  2008-08-18

This is a common question that we’ve encountered in the past during our PoC process.

Question:

Does denormalization help or hurt query performance?

Answer:

It depends.  A well thought out data model may or may not include denormalization of the data.  However, the most important thing is to denormalize data to meet the needs of the business users.  Often, denormalized data results in fewer tables, fewer joins, and easier understanding by end users.  It does result in additional up front work in the ETL process to ‘merge’ tables prior to loading into the data warehouse.

Denormalized data will result in incremental data storage over a normalized model, since the same value can be stored many times (for example, I may store Mississippi instead of MS for the state code).

[ Edited: 29 August 2008 11:43 AM by John Kemp]
Profile
 
Posted: 01 September 2008 08:25 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  764
Joined  2008-08-18

The vendors who build their database technology on the columnar stores
usually claim that denormalization in such stores leads to better
outcome than in case of the row stores. Our ICE is not an exception
here. This is true especially for the decision support query workloads.
Firstly, adding more columns to a big, denormalized table is less
harmful for the query performance in the columnar stores than in the row
stores, especially for the queries which use only a fraction of all the
columns. Secondly, the columns created during denormalization are
usually expected to have highly repeatable values and, therefore, they
are expected to be easily compressible is the columnar stores.

All this does not mean that denormalization should be always the way to
go in ICE and in other columnar stores. As mentioned in the previous
post, it depends a lot on the business users’ requirements, the expected
ETL overheads, etc. Moreover, it is not always the case that the
above-mentioned benefits are the benefits indeed. Firstly, it is not
always the case that the columns resulting from denormalization can be
really compressed better enough than the original ones, kept in the
nicely normalized model. Secondly, the higher cost of the access to the
bigger, denormalized data may overpower the benefits coming from the
replacement of the multi-table joins with simpler, single-table queries.

To summarize, denormalization is always worth considering and, in
particular situations, it may provide very good results. However,
denormalization is not going to solve all the problems of the world, no
matter whether the columnar stores or row stores are applied. Everything
depends on the data, the queries, and the database used as the choice
between the row- and column-oriented storage is only one of many
database technology aspects.

[ Edited: 08 September 2008 01:14 PM by Jakub Wroblewski]
Signature 
Profile
 
Posted: 10 December 2008 07:54 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  12
Joined  2008-10-02

It is also worth to note that denormalized table does not contain primary and foreign keys which link tables in normalized database, so there are fewer columns in total. Moreover, compression of foreign keys may be difficult, especially when dimension tables are large and foreign keys can take on many different values, so removing these columns may lead to significant gain. This is another ‘pro’ of denormalization.

Certainly, all other columns from dimension tables that are joined into fact table would get bigger and occupy more space, due to replication of rows of dimensions. But as Jakub mentioned, replicated values compress better, so if every row is replicated, say, 10 times, total increase in occupied space is usually much lower than 10-fold, I would rather expect like 2-fold increase. Please note that traditional row-oriented warehouse would always grow by the factor of 10 in such case, even if data are compressed (by rows). This is one of many reasons why column-oriented compression vastly outperforms row-oriented compression; and why denormalization is easier to handle for column-oriented warehouse than for traditional row-based one.

Signature 

Marcin Wojnarski
Infobright Inc.

Profile
 
Posted: 13 March 2009 12:16 PM   Ignore ]   [ # 3 ]  
Newbie
Avatar
Rank
Total Posts:  27
Joined  2008-08-18

I have posted a blog on this topic “Reflections on Data Redundancy and the Shades of Denormalization”.

Best greetings,

Signature 
Profile