Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

18
Jul

Gearing up for OSCON

Jeff Kibler's photo
by Jeff Kibler     Mon, Jul 18, 2011

Next week (July 26-July28) marks the kickoff of OSCON, one of the most important gatherings of open-source enthusiasts.  As like in years past, Infobright will be showcasing the best-in-class, open-source analytic database.  For those attending, come stop by Booth 106.  We will be giving out t-shirts and offering tasty treats. 

In the booth this year, we will be discussing:

  • Infobright 4.0 Release
    • Rough Query - Query the in-memory knowledge grid directly to provide a range of potential answers.  Best part is, it's never wrong!
    • Domain Experts - Improve compression, and more importantly decompression, of your commonly used data formats.  Out of the box, 4.0 offers excellant improvements for e-mail address columns in your data.  We also allow you to create your own domain experts.
  • EchoNest Millionsong Dataset - Infobright Open-Source Project
    • Thanks to the EchoNest, Infobright collapsed the million song data structure into ICE.
    • Users can query extremely detailed bits of information about one million songs 
  • Enterprise Edition - Distributed Load Processor and the Hadoop Connector
    • Pull massive amounts of data off Hadoop
    • Spread the compression/knowledge grid creation to other servers to drastically improve load speeds (1 Billion Rows, 2 Terabytes, 1 Hour to load... into 1 table).

Plus, we'll be there to answer any other questions as well as get your feedback.  If you'll be at the show and would like to have a drink at/after the show, send us a shout.  We're at (JavaScript must be enabled to view this email address).

Cheers and see you in Oregon!


05
Jul

On the Use of Rough Query on Machine Generated Data – Areas of Applicability – part 2.

Adrian Andrei's photo
by Adrian Andrei     Tue, Jul 05, 2011
Over the years Infobright pioneered technologies to address large data volumes, from columnar approach to use of Knowledge Grid to narrow down the query results.

 

In a previous post, a scenario was identified as being suitable for improvement using rough queries. In this post we will illustrate how one can improve the overall response time of the system by leveraging rough queries.

 

The Scenario

A data center monitors the temperature inside the server racks through a sensor that generates periodic data bursts consisting of tuples with timestamp, identifier, and ambient temperature that are stored in Infobright engine.
If the temperature rises above 60 degrees (Celsius) an alert is generated and counter-measures are taken to bring the temperature down.

 

The challenge is to make sure the reaction time between the temperature increase and system response is as small as possible.

 

The Existing Implementation

On the test platform described in the previous posting (link here), the query is returnsa response in around 3 seconds:

mysql> select max(temp) from sensor where ts > 1000000 and id = 1;
+-----------+
| max(temp) |
+-----------+
|        35 |
+-----------+
1 row in set (3.10 sec)
By examining the execution plan we can see that 951 packs were loaded during this execution.

2011-06-23 09:19:26 [2] [...] Total data packs actually loaded (approx.): 951
We can improve the implementation by having a two-step implementation where the first check if any sensor reports a higher temperature and then, if it does, drill down into the specific sensor information that caused it.

 

In order to do that, one can issue a query below to detect a temperature increase


mysql> select max(temp) from sensor where ts > 1000000;
+-----------+
| max(temp) |
+-----------+
|        36 |
+-----------+
1 row in set (0.13 sec)
In doing so a relatively small number of packs are loaded as the current engine already uses plan optimization at the Knowledge Grid level based on rough query.

2011-06-23 09:20:02 [2] [...] Total data packs actually loaded (approx.): 6
We can immediately improve this scenario and use the rough query to determine if any sensor reports a higher temperature, such as the query below:

mysql> select roughly temp from sensor where ts > 1000000;
+------+
| temp |
+------+
|   30 |
|   36 |
+------+
2 rows in set (0.00 sec)
The rough query response time is way superior due to the fact no actual data being loaded:

2011-06-23 09:20:16 [2] [...] Total data packs actually loaded (approx.): 0
Further improvements can be done after we examine the behaviour when an incident occurs.

 

The Incident

By loading data where sensor 1 reports a temperature increase the above scenario reports the issue in a little over 3 seconds:

mysql>  select roughly temp from sensor where ts > 1000000;
+------+
| temp |
+------+
|   30 |
|   88 |
+------+
2 rows in set (0.00 sec)
Since we now know there is a temperature spike, we need to determine if sensor 1 is the cause of it:

mysql> select max(temp) from sensor where ts > 1000000 and id = 1;
+-----------+
| max(temp) |
+-----------+
|        88 |
+-----------+
1 row in set (3.03 sec)
The response time is due to the fact 498 data packs were loaded during the exact query execution:

2011-06-23 09:52:03 [2] [...] Total data packs actually loaded (approx.): 498
In the next part we will see how we can further improve and significantly reduce the reaction time by using rough queries.

 

The Rough Query Based Solution

When detecting that there is at least a sensor that reports out-of-bounds temperature, instead of immediately issue the exact query we can leverage the fact that the timestamps are almost monotonic in value and try to narrow down the range when the incident happens:

mysql> select roughly ts from sensor where ts > 1000000 and temp > 60;
+----------+
| ts       |
+----------+
| 13632490 |
| 13659115 |
+----------+
2 rows in set (0.00 sec)
The rough query response is extremely fast as no data is loaded during evaluation, and we now know the upper and lower bound of the timestamp when the temperature increased above the monitored threshold.

2011-06-23 09:55:18 [2] [...] Total data packs actually loaded (approx.): 0
What the response is giving us is a smaller range for the timestamps where the incident occurs so we can now drill down into the exact response and exclude intervals not relevant to our initial search; by comparing the initial threshold of 1000000 and the lower rough bound of 13632490, we can recompose the query as such:

mysql> select max(temp) from sensor where ts > 13632490 and id = 1;
+-----------+
| max(temp) |
+-----------+
|        88 |
+-----------+
1 row in set (0.07 sec)
The response time is now a fraction of the initial response time because significantly less data packs are loaded:

2011-06-23 09:56:35 [2] [...] Total data packs actually loaded (approx.): 3
Using rough queries we can now trigger an alert from 3 seconds to a near real time response of 0.07 seconds.

 

The Conclusion

Leveraging the knowledge of the data distribution and the rough query capabilities of Infobright engine it is possible to obtain spectacular results in terms of reducing the response time. Rough queries are not a substitution of the exact queries but a tool in the hands of any data warehouse analyst that can be used to explore further correlations between data at response times that will make many more scenarios on large semi structured datasets much more feasible than with traditional exact approaches.
Infobright     Tags: example, rough+query

Previous Page   Next Page