Infobright is built upon the MySQL architecture and, although we typically recommend the Infobright Loader (and/or the Distributed Load Processor, DLP, with Infobright Enterprise Edition), frequently questions arise as to what connectivity tools can be used in different circumstances for data access and efficient insertion of data into Infobright. As the MySQL website states:
"MySQL provides connectivity for client applications developed in the Java programming language through a JDBC driver, which is called MySQL Connector/J. MySQL Connector/J is a JDBC Type 4 driver. Different versions are available that are compatible with the JDBC 3.0 and JDBC 4.0 specifications. The Type 4 designation means that the driver is pure-Java implementation of the MySQL protocol and does not rely on the MySQL client libraries."
We recently had a customer who asked us a question about fetching larger sets of data via the MySQL JDBC driver for insertion into Infobright. They told us that this driver worked fine for smaller data sets but their business use case required handling at least one million rows of data.
Their specific inquiry was whether they could specify a larger fetch size of at least one thousand records. Alternatively, they asked us if we could provide a JDBC driver other than the MySQL driver or if we could recommend some other solution.
Initially, we told him that there were a few options: looking at alternate drivers that could be used such as Connector/J, Datadirect JDBC, and even the Drizzle JDBC Driver; streaming the data using JDBC (as opposed to Buffering); or, moving the Java service to another server to avoid memory contention. We also specified that there are known issues with JDBC and memory and that there was a good write-up about the JDBC driver from one of our former sales engineers which could be read at the following link:
http://www.infobright.org/Blog/Entry/infobright_mysql_and_jdbc/
However, this customer was emphatic about getting their Java application to connect to Infobright using the JDBC driver; specifically, J/Connector 5.1.18. A couple of weeks after their initial inquiry, they specified that they tested setting the fetch size to 1000, 100 or 10 (the latter containing a very large result set) and, in all three cases, getting an "out-of-memory" exception error.
They found a workaround to avoid the exceptions. This workaround is based on setting the fetch size to INTEGER.MIN; thereafter, the JDBC driver returns 1 row at a time from the existing result set. This works well as long as the result set is small; but this would be a concern if the result set is really large.
In response to this, I did some research on the INTEGER.MIN setting and whether there was a better "streaming" option. After my research was completed, I responded with the following:
"...the main options for avoiding memory contention between the database and JDBC connections are to either stream the results to the application rather than buffer, or isolate the Java service onto a separate VM. If you end up having a large result set, simply moving the Java service to another server has worked consistently so make sure to do that."
After further review, the problem was not inherently in the JDBC driver, but in the configuration.
A review of the issue can be found here: http://bugs.mysql.com/bug.php?id=18148
By default, MySQL + the Connector/J JDBC driver for MySQL will return the entire result set at once, filling the JVM memory even before stepping through the results set.
ResultSet rs=stmt.execute(queryString);
The variable "rs" now contains the entire result set at once. According to the JDBC API, setting the fetch size will provide a hint to the driver and database to return the results in blocks based on the fetch size so that you do not need to hold the entire result set in memory at once.
The work-around recommended was to setFetchSize(INTEGER.MIN) which will force the results to stream one row at a time. This impacts performance due to network latency for each fetch. It's better to retrieve in reasonably sized chunks to avoid network latency.
However, we found that an additional configuration setting is available and, if you use it, the JDBC driver will perform as expected and use the setFetchSize as intended by the JDBC API.
If you have Infobright and the MySQL Connector/J Driver 5.0.1 or higher this will work, but you need to add a setting in the JDBC URL when establishing the DB connection.
jdbc:mysql://SERVER:PORT/DBNAME?useCursorFetch=true&defaultFetchSize=1000
This tells MySQL to use a cursor to step through the results and sets the connection with a default fetch size. After that, you can actually use the setFetchSize and reset it as needed, or simply use the default you specify at the time the connection is made.
On a similar note, a different customer was experiencing very slow bulk load rates into a specific Infobright table which he described as follows:
"What we basically do is read the data from Oracle, and then load it to MySQL in bulk sets of 50,000 records. We are using 'ojdbc6.jar' to read from Oracle, and 'mysql-connector-java-5.1.18-bin.jar' to write to MySQL. We would appreciate any suggestions you might have to improve the loading time."
Initially, I thought that this customer could benefit from the first customer's experience in setting the default Fetch Size higher. However, this latter customer came back to me and explained that the problem was the insertion speeds into Infobright, not the fetch speeds from Oracle. Again, after researching why loading with the MySQL Connector/J might be slow, I came upon the following three recommendations:
cacheServerConfiguration=true
useLocalSessionState=true
rewriteBatchedStatements=true
The customer tried this last suggestion first and, believe it or not, it made a significant difference in load performance! Here is an explanation of how they tested this recommendation and analyzed the results:
"We were able to use the 'rewriteBatchedStatements=true' configuration with a smaller bulk set of 9000 records (originally we had tried 50,000). With a larger bulk set of greater than 10,000 records, we get the error 'java.sql.SQLException: Prepared statement contains too many placeholders'.
But, with a smaller batch size we see a great improvement when using the 'rewriteBatchedStatements=true' parameter. We saw that we can load 15,800 rows in a second to TABLE1 which is the table that took us the most time to load."
Previously, they were able to load only 2250 rows per second, so they're seeing roughly a 7X performance improvement with this change alone!
I hope these recommendations help if you encounter similar situations in your environment.
Read Comments (4)
With Infobright 4.0 we delivered a number of unique “Big Data” features including Rough Query (link to video) and DomainExpert™ (link to video) to support querying large volumes of machine-generated data. At the same time we also introduced a companion product to Infobright Enterprise Edition, the Distributed Load Processor (DLP) with a Hadoop connector, to enable very fast data load and a simple method to extract data from a Hadoop cluster. The ultimate goal is to extract and load data into Infobright as quickly as possible to allow for great data growth and “queryability”. Customers can use these tools to load over 2 terabytes an hour into just one table. In a previous blog post found here (http://www.infobright.org/Blog/Entry/loading_data_in_infobright_on_amazon_ec2_using_dlp/), we illustrated how to install DLP on an Amazon EC2 instance. In this blog, we will discuss how to leverage our Hadoop connector in conjunction with DLP.
If you’re new to Hadoop, take a look at the Apache™ Hadoop project (http://hadoop.apache.org/). Hadoop provides a highly-scalable approach to crunching through insane amounts of data. The main use case for Hadoop is to deal with near-petabyte+ worth of data. Hadoop provides a great resource for storing *all* of your detailed data for great lengths of time. Plus it doesn’t require a schema. So if your data is semi-structured or unstructured, Hadoop is a great asset in your arsenal.
Where Hadoop tends to fail is with respect to fast analytic response. While there are technologies that sit atop Hadoop intended for analytics, they do require a level of overhead. If you’re looking for blazing ad-hoc query speed, you won’t necessarily find it with Hadoop and MapReduce. To that point, Infobright provides that capability. By structuring and moving over a subset of the most highly relevant or even aggregated data into Infobright, you can provide an easy and fast user experience. We are strong advocates of using the right tool for the right job, and the combination of Infobright and Hadoop is a good example. If you are interested, watch the video about LiveRail (link to video on Youtube), which explains how and why they are using both.
Jumping ahead to the specifics, I assume you have Hadoop installed and working properly. Plus, I assume you’ve installed DLP. (You can download an Enterprise Edition free trial along with a DLP trial at http://www.infobright.com/Products/Product-Demo/ if you don't have a DLP license). Java is also required, so ensure the latest JDK is available. To utilize our Hadoop connector, a few pre-requisites are required.
These steps were verified using:
Infobright Database: IB_4.0_r13151_13690 (iee-commercial)
Infobright DLP (this includes the Infobright Hadoop Connector): DLP 1.1.0 64Bit
Hadoop: hadoop-0.20.205
Operating System: CentOS 5
Once you execute the command, you should see your data in the database. If you receive any errors, please consult the documentation for error codes. The official Hadoop connector documentation can be downloaded on the support portal at infobright.com.
If you're wondering how Infobright started, here is a brief history of the company and the technology:
Infobright was founded in 2005 to leverage a mathematical approach, called Rough Set, to solve data management and analytic problems. Rough Set started with a Polish computer scientist Zdzislaw Pawlak in 1981 as a mathematical tool to deal with vague concepts. In his theory, Pawlak describes the lower and upper approximations of a set as crisp or conventional, defining the upper and lower boundaries. This approach is useful for rule induction from incomplete data sets, and in other variations, also helped to form approximating sets known as fuzzy sets. Dominik Slezak, one of Infobright's founders, explains this as "we follow the rough set approach to identify: (1) the data portions that are fully relevant to the given query execution; (2) the data portions that are fully irrelevant to the given query execution; (3) the data portions that remain undecided." This theory when applied to data mining and machine learning was quickly adopted by a variety of industries with different applications.
Infobright's founders realized that Rough Set is a powerful tool to enable fast queries against large data sets without doing all the database administration work that had always been a requirement in the past to achieve fast performance. Instead of requiring indexes, data partitioning and other typical techniques, intelligence in the software could drive performance. Infobright calls this intelligence the Knowledge Grid. Wedding the Knowledge Grid to a columnar database architecture produced a powerful solution that could handle large amounts of data fast and simply, at a very low overall TCO.
In 2006 Infobright formed a partnership with MySQL, taking advantage of the "storage engine" architecture that MySQL had to encourage other companies to create new databases for different use cases while taking advantage of many MySQL functions. This integration meant that migrating from a row-based MySQL database to the Infobright columnar-based database would be as simple as a command line change.
Infobright first introduced its' technology, then known as Brighthouse, at the 2007 Rough Set Conference in Toronto to a very positive reception. A few months later, in 2008 the company released the industryís first commercial open source analytic database software (infobright.org) and started building a strong and growing open source user community, with more than 15,000 downloads in the first year. Within a year, Infobright had more than 40 customers including ISV OEM customers who embedded Infobright in their own software offerings.
Since 2008 there have been a lot of other changes. For one thing, the product is now called Infobright not Brighthouse. There are tools to integrate with major BI partners such as Pentaho, Jaspersoft, Talend, Actuate and Informatica. Users can load data several different ways depending on their needs such as using the Infobright loader, the MySQL loader, Infobright's Distributed Load Processor, or other ETL tools. Customers have reached data load speeds of up to 200,000 records per second. Infobright positioned themselves as the leader in the open source data warehousing community, and soon after, for recognition of their outstanding contributions to the MySQL ecosystem, Infobright was awarded the prestigious MySQL Partner of the Year award by Sun Microsystems in April 2009.
Using built-in intelligence, Infobright's unique way of storing and analyzing machine-generated data has provided the vehicle to near real-time analytics in big data. Machine-generated data has become one of the fastest growing categories of big data, with sources ranging from web, telcom network and call-detail records, to data from online gaming, social networks, sensors, computer logs, satellites, financial transaction feeds and more. This focus on machine-generated data within big data, beginning in 2010, gave rise to the rapid increase in customer momentum. And our latest version 4.0, released last summer, included Hadoop connectivity, as well as the introduction of DomainExpert™ and Rough Query. Developed exclusively by Infobright, DomainExpert™ uses specific intelligence about machine-generated data to automatically optimize how data is stored and how queries are processed. Rough Query leverages our Knowledge Grid to deliver data mining drill down at RAM speed, otherwise known as "Investigative Analytics.".
From the beginning our executives and engineers recognized that the looming database challenge was how to analyze and extract actionable knowledge from very large (and growing) data sets. Clearly the market agrees as terms such as "Big Data" and "machine-generated data" become more commonly used. In addition, companies appreciate that Infobright's approach "to work smarter not harder" means that their users can get fast query response, even to ad hoc queries, without a high overhead of database administration or hardware costs.
By 2011 eight of the top ten telecommunications service providers worldwide are using Infobright to mine their big data. Hundreds of customers use Infobright daily and more than 100,000 users have downloaded both our community and enterprise editions. Infobright is still leading the industry and paving the way.
Craig Trombly, Infobright Community Manager