It has been a little while since I have had the time to sit and write again and I can admit that I quite miss it. From a community standpoint, myself and the intern team have been really busy with getting the new websites in place. We also have alot plans for some really awesome projects being planned with some other open source companies like Akiban, as well as some Social Media contests. Recently though, I have been handling a fair amount of requests dealing with common issues, ranging from simple connections to permission issues. As a general rule, approaching these issues, there is a common few steps that are taken to help eliminate where the issue lies.
This is a very common issue because a default installation of Infobright only allows the root user access to the instance from the machine itself. Granting the remote access is very common usage, though not suggested for the root user becuase it is not good practice to allow root access except from the machine itself, as a security protocol. Stay tuned next week for some more exciting news concerning the Infobright Community....
An infographic is a good way to explain something in a very visual way. As many of you know, we do a number of things to help people make sense of the crazy database world. Here is an infographic on Big Data that helps explain row vs column vs NoSQL. Be sure to click the right arrows to see more info. Take a look!
if you have been following Infobright for any period of time, you know that we are big proponents of using the right database for the right use case. That's why we spend time and effort helping educate our prospects and community members about different technologies and which use cases they are best suited for. Recently Jeff Kibler, our SE extraordinaire and former community manager, downloaded and tested MongoDB and has written a new white paper about using MongoDB with Infobright. You can download it from our white paper page: http://support.infobright.com/Support/Resource-Library/Whitepapers/ (scroll down a bit to the Infobright Technical White Papers and you'll see it).
If you haven't already read the white paper "The Emerging Database Landscape" which discusses row, columnar, and NoSQL databases you can download it here: http://www.infobright.com/land/emerging_database_landscape/
I just recently returned from the O'Reilly Open Source Conference in Portland, where Infobright was exhibiting. For those of you who have not had a chance to visit Portland, it is a great location for this conference. As you walk into the Convention Center, you do not realize how big the property is until you actually find the hall where the conference is. There are these huge sprawling hallways that tend to open up and just when you thought you found the end, there is more. Walking into the hall where the exhibitors were located, there seemed to be a real sense of excitement from everyone involved, not the sense of "this is something we have to do", but more like something everyone was looking forward to. This seemed to be true for the attendees as well.
The exhibitor list ranged from all of the big companies that I expected to see, such as Oracle, Intel, O'Reilly, to quite a few of the up and comers in open source, like Akiban. One of the bigger themes seemed to be the Cloud, and it was clearly attracting a lot of interest. What I found to be most interesting was that the general intent of the attendees seemed to be education. A majority of the attendees seemed to exhibit a real interest in learning about new technologies, which for me meant the need to figure out "How do I earn their attention?". If you are reading this, chances are you already know who Infobright is and what we do, which we excel at. But to continue to grow a strong community, we have to continue to push and expand our base. This was the approach I took and it worked well.
The majority of the attendees I spoke with were unaware of Infobright and how we play a major role in the big data industry. I was able to attract many of them by discussing the core of our technology and how that can be used to solve some of the issues they face within their own businesses.
The conference itself was a hodge-podge of various technologies, which helped to keep attendees on their toes between many of the different speaker sessions that were available. They were eager to learn about new technologies, but it was a constant reminder to me that even though the big data industry is large, the general knowledge and understanding of it is mostly known to those people within the industry. There are numerous engineers and architects who were already familiar with Infobright but the vast majority were not. One of the individuals I spoke with who was already familiar with us said, "Infobright, yeah, tested it, liked it, performed great."
The weekly technical webinars Infobright puts on align with the areas of interest I saw at OSCON– helping educate people on the vast array of emerging technologies for dealing with big data. Education was why so many people came to OSCON. Don't make that a once a year proposition, continue to seek out new sources of learning.
So, for Infobright, the conference was a success and left a good impression on me for the next one. Oh, but you should definitely check out Intel's new Open Source technology Center at 01.org.
I was recently asked this question, "How do you define big data?" My response was "I say big data defines you."
With that in mind, let me expand on my definition. To do this, I would like to digress and say that big data in definition is exactly that; it is a big amount of data. (Current industry discussions add Velocity and Variety, and sometimes Value, meaning how fast it needs to be analyzed to provide value). Data is by definition information in some format or structure, so "big" must be analogous for amount. Yet my definition of big data goes a bit past this. The big data industries that have formed over the last decade focus on many different aspects of it; some deal with the infrastructure to house and manage, some deal with the software, some are in the analytic and business intelligence world, some are the consultants, and numerous companies focus on marketing, education and conferences on the subject. Within each segment of the industry, definitions may vary a little, but there are still some very key attributes that remain the same.
One, there are vast amounts of data which present unique challenges. Two, above and beyond the tasks of storing, managing and analyzing, working with this data is a primary goal (otherwise, why store it.) Three, we look for insights that we can extract from this information to enable us to make decisions that will define the way we act upon the information. For instance, a police chief in Pennsylvania knew that by analyzing criminal behavior and patterns within them, he would be able to reduce crime in a geographic area by implementing a higher police presence at specific times. The data and information that he extracted defined his response. Whether someone is researching a large number of publications and extracting relevant topics on specific articles, or they are gathering sales information for yearly trends during the rise and fall of economic windfalls, the process is still the same: we store, we search, we analyze, we define.
As a programmer myself, I have dealt with the various aspects of the infrastructure surrounding data. It is the service- and data-oriented interfaces that provide the vehicle to deliver information in a meaningful way. I have always focused on ensuring that the end user was able to get to the data that they were looking for, otherwise the data is almost useless. Data's reason for existing is solely the value of that information and what can be extracted from it. One of the most exciting aspects of this industry for me was reading about the different use cases for our analytic database and the very unique approaches that different organizations took to solving complex problems in simple elegant steps. Browse to http://www.infobright.com and read about how the Canadian Space Agency is using Infobright to store and read their machine-generated data, or any one of the JDSU or telecom-related whitepapers.
At the end of the day, we define our actions by what we learn.
Last week Infobright, Pentaho and Semphonic announced a joint initiative to help organizations derive business advantage from their detailed website data. For companies who depend on their websites to drive revenue (whether that be lead generation, ecommerce, or other activities) our goal is twofold:
As Web analytics has been around almost as long as the Web itself, you may be wondering why you need something more than you probably have today. To paraphrase from the white paper, “Everything You Know about Digital Measurement is Wrong..and How to Get it Right” , the problem is that all of the traditional Web analytics metrics are meaningless in terms of marketing and targeting. Segmentation in the digital age is more complex than with brick-and-mortar stores, and it requires both a new way of modeling the data and a new way to present it. It needs to answer not just Who? but Why?
The work that Infobright, Pentaho and Semphonic have done is a great start towards making it easy to implement a system that provides the answers your marketing teams need. To learn more, download the white paper, register for this the July 19 webinar, or try out the demo for yourself. Become an IT star to your marketing team!
I just recently had the experience of working with a few datasets that were quite sizeable to manage on my laptop. And while I found the challenges similar to those that many of you have already experienced, I did identify a few steps that make things a little easier. Of the most important are process and preparation. I found that the more I knew about what I was going to need to do, the smoother the general process was in each of the steps.
Most of the common challenges people are familiar with surround getting the data prepared for an analysis environment or tool. Usually common ETL tools are used to do this, but what if you are preparing data for the first time and you are not sure what that process is? Outlining your requirements for each step of the process is key in avoiding common pitfalls.
* What is the data?
* Size of the data?
* Is it compressed?
* What are the staging environment storage requirements?
* Tools for analysis?
* Preparing the data
* Moving it around.
* Putting it in production.
In my scenario, I had a single file of Web log data that was 47GB in file size. I knew that with a file this size, no matter what decisions I made or outlined, it was going to take some time to accomplish each stage given the hardware resources I had (the aforementioned laptop). I did not have enough disk space to extract it and load it into a database on my computer and I would expect the same issue with a lot of people when working with file sizes like this. So then next step was to rar the file, which took about an hour and SCP it over to a local server in my network. Using an SSH program, i was able to unrar the file and then connect to an Infobright command prompt and start to load the data into a staging table.
It ultimately was 157 million records. Where I ran into problems were with the format of the line terminators of the file. Because it came from a Windows machine and I moved it down and rarred it onto a Linux box, the line terminators were different, but I did not find that out until 30 minutes into the first load data exercise. I made the correction and then 38 minutes later, I had a count of 157 million records...... Something to note, I did use the reject file parameters for the data load and was actually surprised that it encountered no data errors during the load.
Now I was able to use Toad and take a look at the actual data and sample it accordingly. This allowed me to decide what data points I could optimize and what points were important. I had to look into making this data more optimal for the analysis environment so it could be queried in a timely fashion. I made a few DDL changes with the table structure and was able to test the response time to my satisfaction.
Now since this was going to a production machine, i had to dump the data to a file, rar it up, send it to the server and load data again. I do remember thinking to myself, this was an exercise. By the end of the day, I had the file in a production database and quite optimized. Looking back at what I did to accomplish this taught me this:
1. Always flowchart each step and document it
2. Be patient, but know as much as you can about the data itself
3. Know your environment(s) implicitly
4. Be prepared for something to error and how to identify where the breakdown is
The key point I learned is process and preparation. Whether you are working with 47GB of data or 47TB, the point is still valid. Stay tuned and I'll let you know what I am doing with this data...
These days I am visiting an amazing place: King Abdullah University of Science and Technology (KAUST, http://www.kaust.edu.sa/). It was established quite recently to strengthen research and academic activities in Saudi Arabia. It shows that, fortunately, there are still people who understand that an investment in science is the best kind of investment that one can imagine.
I am sure I will have an occasion to write more about KAUST in the next couple of days. For now, let me explain the reason why I am here. Namely, to attend the workshop on Three Approaches to Data Mining (http://trees.kaust.edu.sa/Pages/NewsWorkshop-8-06-2012.aspx) organized by the group of Dr. Mikhail Moshkov (http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/m/Moshkov:Mikhail_Ju=.html). The workshop is a continuation of an excellent book on Test Theory, Rough Sets and Logical Data Analysis (http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-642-28666-7). The book compares three approaches to data analysis by focusing on similarities between their goals, foundations and applications.
Let me share with you a couple of citations:
These three approaches have much in common. For example, they are all related to Boolean functions and Boolean reasoning with the roots in works by George Boole. However, we quite often observe that researchers active in one of these areas have a limited knowledge about the results and methods developed in the other two. All three data analysis approaches use decision tables for data representation. A decision table T is a rectangular table with n columns labeled with conditional attributes. This table is filled with values of (conditional) attributes and each row of the table is labeled by a value of the decision attribute d. There are different problems of data and different problems associated with them. A typical problem in all three approaches is the problem of revealing functional dependencies between conditional attributes and the decision attribute. However, all three approaches use different terminology related to this problem.
Indeed, when reading the book and listening to the workshop talks (particularly those introductory ones, such as the one by Dr. Endre Boros (http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/b/Boros:Endre.html)), I can see a number of commonalities. In the book, the authors actually made an effort to identify such commonalities. (For instance, the notion of a reduct, known from Rough Sets as a minimal subset of conditional attributes determining decision, is comparable to a dead-end test in Test Theory. In the same way, the notion of a super-reduct, where some attributes are allowed to be redundant while determining decision, can be compared to a support set in Logical Data Analysis).
This reminds me of another book published this year - Rough Sets: Selected Methods and Applications in Management and Engineering (http://rd.springer.com/book/10.1007/978-1-4471-2760-4/page/1) – where we attempted to gather some practical examples of how Rough Sets can be put together with other approaches in order to efficiently solve the real-world data analysis projects. In the case of both books, the crucial aspect is to understand how different methodologies can complement each other at the level of foundations and applications.
Of course, as human beings, we are often tempted to say that our approach or product is the best one. I could see such attitude many times both in business and science. (Maybe even more often in science...) However, we need to remember that the most valuable solutions emerge usually at an edge of different areas. I think that this is also a part of the vision of the founders of KAUST. And this is also why I’m going to listen to the rest of the workshop presentations very carefully…
While using command-line interfaces is status-quo for *nix lovers, I acknowledge that front-end BI tools are much prettier and more user-friendly than the command-line. In the vast majority of our community and enterprise users, we see the traditional data pipeline consisting of data source, ETL, Infobright, and BI tool. From time to time, we’ll see Hadoop or other scale-out storage solution in front of Infobright, but for the most part, the setup is fairly simple.
Traditionally, though, BI tools cater towards the original RDBMS vendors. In the original world of RDBMS, the debate revolved around how much normalization should occur within your schema. Star or snowflake, schemas for the traditional RDBMS, address common problems with a flattened schema. However, Infobright is different. Due to the Knowledge Grid architecture coupled with the industry-leading compression, we actually prefer a “de-normalized” or flattened schema structure. Unfortunately, this preference can cause a couple of headaches with the BI tool SQL Generators.
Since most BI tool SQL generators assume indexing as opposed to our Knowledge Grid architecture, the generated queries favor join-heavy operations. In addition, they sometimes force a “select *”, a direct no-no when using a columnar database. The greatest benefit of columnar is to avoid disk I/O. By choosing “select *”, you run the risk of losing that benefit.
When using BI tools from companies such as Pentaho, Jaspersoft, Actuate, Microstrategy, Tableau and others, review the generated queries. If your data is already flattened, then you may already see OK generated queries. However, if you still use star schemas, consider altering the queries to improve performance. For example, replace joins with sub-selects if possible. Also, ensure you group-by numeric columns as opposed to varchars; some BI tools lose that human insight and group-by a varchar and its numeric interpretation (ex: Employee Name, Employee ID).
BI tools are here to stay, and they really help make visualization of analytics easy. When working with Infobright, always take an extra second to review the generated queries. The extra few seconds could mean seconds or minutes in saved query times.
Infobright is committed to making new software releases and features more usable and friendlier. A notable feature in the current version (4.0.6) is the abililty to use a reject file during data loads. Previously, if there was a row-level error, the entire process would return a failure and not commit. The new feature gives you more control over row-level data errors during the load process by allowing you to set a few options beforehand.
Let's take a quick look at how to use this feature as described in the User's Guide:
There are primarily two different ways to use this feature. The first way is to set the path for the file using the @BH_REJECT_FILE_PATH variable with @BH_ABORT_ON_COUNT
/** when the number of rows rejected reaches 12 (Like the Cub's recent losing streak), abort process **/
set @BH_REJECT_FILE_PATH = '/tmp/reject_file';
set @BH_ABORT_ON_COUNT = 12;
load data infile DATAFILE.csv into table T;
To tell the loader to never abort, simply set the @BH_ABORT_ON_COUNT = -1. To use a percentage of row errors instead of a count, use the @BH_ABORT_ON_THRESHOLD variable like:
/** if 3% of the number of rows error, then abort the commit process **/
set @BH_REJECT_FILE_PATH = '/tmp/reject_file';
set @BH_ABORT_ON_THRESHOLD = 0.03;
load data infile DATAFILE.csv into table T;
Then to turn this feature off:
/** Disable the reject file feature **/
set @BH_REJECT_FILE_PATH = NULL;
set @BH_ABORT_ON_COUNT = NULL;
Something to note during the data loading process, please be aware of the differences between empty values and null values. The only way to use a null value during import is to specify the word NULL in all caps or '\N'. This differs from a space ' ' or even '' or "". You should always use quote encoding on the columns in CSV files to ensure that commas will not impede your data load. Using TAB or '\t' delimiters are often less troublesome in your data files, however this is just my preference, just be aware of your settings. What happens to the data row that errors? That entire row is passed as it is directly into the reject file. This enables the user to correct the issues in the reject file and then LOAD that file after each row is corrected. Ninety percent of the time, visually looking at a data row will allow someone to determine what the problem is, especially when this is a process that is repetitive. Features like this enable users to be more efficient and greatly enhance the user's experience.
Just another example of working smarter, not harder.