Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog infobright featured blog

25
Apr
Infobright
by Dan de Grazia     Wed, Apr 25, 2012

I am new to Infobright but I have been working on the problems we solve for many years This is the first of several blogs I plan to write on both business and technical issues, including one on how to get reliable results from a proof-of-concept.

Maybe it is the everyman IT story but it is so common that it has become part of the background noise of Big Data, particularly the problem of machine-generated data: it is the problem of finding a solution to our solution.

Most of us continue doing what we have done in the past until it doesn't work anymore. That seems pretty smart. And it is. However, once it stops working things get foggy quick. Big machine data problems are a classic example. Twenty and even ten years ago we were all doing fine, our existing hardware and row-based solutions were able to scale with the data volumes. We were working in a world where Moore's Law kept us out of trouble on the raw power side and disk manufactures kept innovating storage systems. We were IT and we were smart and we had money to spend.

For several reasons, such as data retention regulation and Internet access expansion, our money started to run out. More importantly we never considered that our needs would outstrip our growth curve. Ten years ago we were confident that Moore's Law would hold and it has. We never thought that our problems could grow faster than that. Well they did.

Here is where the real opportunity sets in (read big problems). We naturally think in terms of the tools we already have to find solutions to our problems. It is one of my worst habits, to go to my toolbox to see what I might have that fixes my problem, or more accurately that fits the solution I think I need. With Big Data we wanted a solution in the worst way. In many cases this is what we have. Studies show the top responses to big machine data analytics problems are: reduce the amount of data the user has access to, hire more DBAs and buy bigger, faster hardware. In some cases one quarter of our IT budget goes to storage. We did this knowing we were sinking faster than we were bailing. Same toolbox yields same tools.

Next week I am are going to discuss other technologies that attempt to add tools to the toolbox and why this is so hard to do. I do want to end with one last "new guy" insight from my beginnings at Infobright. Most people come to Infobright having already seen that applying standard thinking to their problem has failed. They now suspect they have the problem of looking for a solution to a solution they have already spent considerable time and money on. Infobright is what happens when serious problems are tackled by serious thinkers. In the realm of big machine data this means speed and compression go up and cost goes down. Until Infobright, everyone above was right; you were either small or you were powerful but you were not both – and low cost was never an option. As we will cover in the next few blogs we finally have a solution to our solution. Infobright.

Infobright     Tags: big+data,

11
Apr
Jeff Kibler's photo
by Jeff Kibler     Wed, Apr 11, 2012

In Infobright 4.0 (ICE and IEE), we optimized our lookup functionality. The 10,000 limit recommendation has been removed.

While the removal of the 10,000 limit recommendation allows for larger dimension tables to be flattened into the fact table, one should also consider the ramifications of using 'lookup' columns. As the lookup is stored uncompressed in memory, you're limited by the amount of resources on the system. If you have a very large table with a very large number of distinct values, you may consume significant RAM resources with this lookup. Therefore I recommend you only utilize large lookups on columns which are critical and beneficial to you. Do not use 'lookup' on a column just because you can; consider how often you use that column and compare with resource consumption.

In short, use lookups when you can maintain low cardinality (>= 10:1 ratio of total-to-distinct values). When the total number of distinct values is extremely large, justify the use of RAM before setting the lookup flag.

To wrap up:
· Have a >= 10:1 ratio of total-to-distinct
· Ensure you have enough RAM to hold all uncompressed, distinct values in memory (without causing other processes/queries to suffer)
· Lookups are only applicable to varchar/char fields. Numbers and Dates will be ignored.
· Only consider lookups for commonly used columns in the select, where, and group-by clauses of queries. Rarely used columns only suck up RAM usage.
· Don't use lookups as a general 'surrogate key'; only use lookups when you need to use it.

Other things to consider:
· Initial Server Startup Time can be impacted if you have an extremely large number of lookup columns. It's pulling those values off disk and putting them in RAM when you start the service.
· You're automatically taking RAM away from other processes/queries when using Lookups
· You cannot change the DDL to remove or add new lookups. It requires a full data dump, drop table, create table, and re-load in order to add/remove lookup columns. In the future, we hope to change where 'lookups' are defined, but at least for now, it's a risk.
· DomainExpert™ technology (beginning in 4.0) is a great alternative for any column which doesn't fit the lookup paradigm *and* has a repeatable pattern (ex: e-mail addresses).
· If DomainExpert and Lookups do not qualify, adding an md5-hash-equivalent column can help with query times on char/varchar columns. More information on MD5 hashing can be found here: http://www.infobright.org/images/uploads/blogs/how-to/How_To_Efficiently_Search_Strings_in_Infobright.pdf


09
Apr
Infobright
by Rives     Mon, Apr 09, 2012

We all have witnessed the explosion of data and the challenges it brings. Today organizations are awash with data. When I first started in Information Technology we were talking gigabytes, today it's terabytes, petabytes and even exabytes. Next up zettabytes.

Healthcare is no different, though their data is more complex as it's wrapped in layers of complex regulations and stringent safeguards thanks to federal and state regulations. In the US it includes HIPAA, HITECH Act, FISMA and a litany of other alphabet soup regulations.

The ability to collect Big Data within healthcare is not the problem, organizations are already doing so, particularly when it comes to log management for compliance. It's the ability to process, store and interpret massive amounts of information which is one of today's most important technological drivers within healthcare.

One of the fundamental problems with log management within many organizations is effectively balancing resources with the torrents of HIPAA log data being generated on a daily basis. The high frequency of log generation is further complicated by the length of time HIPAA requires that these files need to be retained in order to guard against non-compliance and audit issues. Add to this the need to perform regular effective analysis of this data. The more you look at log management within healthcare, both the technical and economic challenges of storing and analyzing terabytes of log information become very clear.

The traditional approach to log management is to store these logs within a row-based database. However, these are not well-suited to manage and store the surging data volumes required by these regulations. Doing so puts IT administrators between a rock and a hard place in terms of mitigating the plummeting performance and increasing storage requirements.

The other approaches are to deploy a general-purpose data warehousing solution, an Event Log Management application, or an appliance. While most of these are great solutions they often pose a very costly proposition, both in terms of hardware, licensing, and DBA effort. The high DBA effort is evident in how these solutions address the most costly aspects of managing surging data volumes, which are needless I/O operations and high latency, thus requiring some sort of database tuning.

Also many of these solutions are best suited for workloads that consist of a high volume of planned, repetitive reports and queries. This approach fails to address the growing need for a data warehouse designed for the ad hoc, investigative analysis that healthcare organizations require to perform effective analysis of their log data.

Infobright provides a more innovative approach for quickly analyzing the fast-growing volumes of event data. A purpose-built, self-tuning column store analytic database, designed to deliver a scalable solution. Infobroght is optimized for the complex ad hoc analysis required by healthcare organizations today.

Infobright's architecture solves the limiting factors of traditional databases, by minimizing disk I/O, eliminating the need for database tuning and delivering the ability to allow queries to be run in a single column, thereby limiting the search to relevant data rather than the entire database. Infobright provides a solution that provides better performance and a greater degree of scalability than traditional approaches, allowing organizations to store analyze more data for a fraction of the cost, a fraction of the time and a fraction of the hardware requirements of other solutions on the market.


Previous Page   Next Page