Having wide, fat columns typically outperform star schemas. When condensing star schemas, one may experience performance drops with varchar/char fields for those ‘dimension’ columns in this wide, fat table. Infobright’s ‘lookup’ functionality can assist you in optimizing performance. By hashing UNIQUE values of the lookup column in memory, we can achieve higher compression as well as faster query performance. We see these results as we are actually using an integer underneath the hood. In the past, we had stricter limitations surrounding lookups; starting in Infobright 4.x, those limitations have been loosened.
While limitations may be loosened, one should also consider the ramifications of using 'lookup' columns. As the lookup is stored uncompressed in memory, you're limited by the amount of resources on the system. If you have a very large table with a very large number of distinct values, you may consume significant RAM resources with this lookup. Therefore I recommend you only utilize large lookups on columns which are critical and beneficial to you. Do not 'lookup' a table just because you can; consider how often you use that column and compare with resource consumption.
In short, use lookups when you can maintain low cardinality (>= 10:1 ratio of total-to-distinct values). When the total number of distinct values is extremely large, justify the use of RAM before setting the lookup flag.
To wrap up:
• Have a >= 10:1 ratio of total-to-distinct
• Ensure you have enough RAM to hold all uncompressed, distinct values in memory (without causing other processes/queries to suffer)
• Lookups are only applicable to varchar/char fields. Numbers and Dates will be ignored.
• Only consider lookups for commonly used columns in the select, where, and group-by clauses of queries. Rarely used columns only suck up RAM usage.
• Don't use lookups as a general 'surrogate key'; only use lookup when you need to use it.
Other things to consider:
• Initial Server Startup Time can be impacted if you have an extremely large number of lookup columns. It's pulling those values off disk and putting them in RAM when you start the service.
• You're automatically taking RAM away from other processes/queries when using Lookups
• You cannot change the DDL to remove or add new lookups. It requires a full data dump, drop table, create table, and re-load in order to add/remove lookup columns. In the future, we hope to change where 'lookups' are defined, but at least for now, it's a risk.
• DomainExpert™ technology (beginning in 4.0) is a great alternative for any column which doesn't fit the lookup paradigm *and* has a repeatable pattern (ex: e-mail addresses).
• If DomainExpert and Lookups do not qualify, adding an md5-hash-equivalent column can help with query times on char/varchar columns. More information on MD5 hashing can be found here: http://www.infobright.org/images/uploads/blogs/how-to/How_To_Efficiently_Search_Strings_in_Infobright.pdf
Read Comments (2)
It has been a little while since I have had the time to sit and write again and I can admit that I quite miss it. From a community standpoint, myself and the intern team have been really busy with getting the new websites in place. We also have alot plans for some really awesome projects being planned with some other open source companies like Akiban, as well as some Social Media contests. Recently though, I have been handling a fair amount of requests dealing with common issues, ranging from simple connections to permission issues. As a general rule, approaching these issues, there is a common few steps that are taken to help eliminate where the issue lies.
On Unix/Linux:
On windows:
This is a very common issue because a default installation of Infobright only allows the root user access to the instance from the machine itself. Granting the remote access is very common usage, though not suggested for the root user becuase it is not good practice to allow root access except from the machine itself, as a security protocol. Stay tuned next week for some more exciting news concerning the Infobright Community....
An infographic is a good way to explain something in a very visual way. As many of you know, we do a number of things to help people make sense of the crazy database world. Here is an infographic on Big Data that helps explain row vs column vs NoSQL. Be sure to click the right arrows to see more info. Take a look!