Joinutility seperatorLogin utility separator Infobright.com
02
Apr

POSSCON 2012 - Great Success!

Jeff Kibler's photo
by Jeff Kibler     Mon, Apr 02, 2012

Last week, I attended and spoke at POSSCON 2012, (one of ?) the largest open-source software conferences on the east coast.  Boasting nearly 700 attendees this year, the three-track set of presentations went over very well.  When you couple the southern hospitality, the event was surely a cloud pleaser.

In the technical track, I spoke about one of my most favorite topics: the emerging database landscape.  I introduced the basics of row, column, NoSQL and even NewSQL technologies: where they fit, how they fit, and who needs them.  In total, we had about ten questions at the end of the talk.  Plus, we had representation from Oracle and MariaDB to help answer any vendor-specific questions.  At the conclusion of the talk, I provided the link to download our emerging database landscape whitepaper (you can get it here: http://www.infobright.com/land/emerging_database_landscape/).  Beyond my talk, they discussed several really cool technologies.  FOr example, Twitter spoke about their extensive usage of open-source technologies.  It really does take a lot of open-source to power just 1 tweet.  It also explained why my tweets take a second or two to appear in my feed wink.  

As for one of the keynotes, Scott McNealy provided great context for the survival of open-source.  We need a champion, and the champion needs to be a power player.  He encouraged the community to find its new champion and herald its momentum.  Without a leader, I believe Scott is correct: we can only sustain the momentum behind the power of a capable leader.

The organizers of POSSCON should be extremely proud.  Todd Lewis did a magnificent job of working with vendors, patrons, and speakers, and his staff did a wonderful job of showing everyone a great time.  Honestly, this conference was far better than any other conference I have attended.  I really look forward to the opportunity to returning back next year.  

For more information on POSSCON, please visit: http://www.posscon.org/


20
Mar

Couchbase NoSQL Database Technology Whitepaper - Drinking Too Much Kool-Aid?

Jeff Kibler's photo
by Jeff Kibler     Tue, Mar 20, 2012

Couchbase, the recently-released, unique Key-Value Store/Doc Store hybrid, is a powerful tool within the NoSQL Space. As with most other NoSQL data storage options, Couchbase thrives on schema-less design and scale-out architectures. They, along with Hadapt, Cassandra, and others, create good technologies that fill a very critical need within scalable architectures. Yet, I believe some of these technologies tend to drink too much of their own Kool-Aid. In a recent whitepaper, Couchbase articulates their strategic differences between relational databases and NoSQL technologies. Initially comparing to 1970s technology (see: AA's Sabre, BofA Automation System), they highlight the lack of scale in early RDBMS technologies. I agree completely -- technology from the 1970s, 1980s, and 1990s are probably not going to scale as well as Couchbase. Plus, I agree that system uptime requirements have ballooned considerably over the past forty years. However, that's where my agreement ends with Couchbase's assessment.

First, Couchbase seemingly ‘bucketizes’ all RDBMS technologies in one bucket. That bucketing means that MySQL, Infobright, Teradata, and VoltDB all belong to the same group; yet each of these technologies fit drastically different use cases. To ‘bucketize’ with blanket statements these systems into one group shows naiveté. Scale within some of these technologies reach into petabytes.

Secondly, once bucketed, they suggest that RDBMS technologies have no place within the future of database and data storage technologies. On page 8, the whitepaper mentions that, "changes like these [adding 1 column to a schema] are extremely disruptive and therefore frequently avoided." Untrue. Infobright can easily add columns quickly and efficiently. It appears the whitepaper assumes that all RDBMS technologies are transactional.

Thirdly, the whitepaper insinuates that optimizations found in RDBMS technologies are improper. Query sharding and data sharding are 'work-arounds', I concur. Yet, when applying appropriate business logic, the 'work-arounds' efficiently mitigate the most common of issues. Plus, many (newer) RDBMS technologies have high scalability. To suggest that current RDBMS technologies adhere to 1970's mentality is incorrect.

Finally, the most incendiary notion is that RDBMS technologies are a "disease" ("Fight symptoms but not the disease itself"). In my opinion, to say it again, Couchbase is beginning to drink a little too much of its own Kool-Aid. I am always leery of any vendor who claims their product (or, in this case, product family) is the best (and only) answer to all problems.

Couchbase should recognize that RDBMS technologies do have a strategic place in enterprise data management. I can guarantee that Couchbase does not satisfy all or even most of the use cases within the big data world. In fact, I'd further argue that NoSQL as a product family cannot satisfy all of those use cases. I look at the explosion of RDBMS technologies within the transaction and analytic worlds; their success proves the point.

I have great respect for Couchbase; I was able to meet some of their team last year at ZendCon. I am eager to get a few moments to try out the technology; in fact, I'm downloading it now.  I expect great things from Couchbase, and I look forward to watching them boom.  At this point, the only recommendation I have for Couchbase is to avoid the marketing scare tactics and be honest on the pros and cons of each technology.  

Link to Couchbase Whitepaper: http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQL-Whitepaper.pdf


13
Mar

Need Real World Experience?

craigtrombly's photo
by craigtrombly     Tue, Mar 13, 2012

Infobright has an established internship program at the University of Illinois at Urbana-Champaign and we work closely with The Career Center here. Our program offers Computer Science/Computer Engineering students and MBA students with CS/CE backgrounds an opportunity to gain real world experience. We offer students part-time paid work under the supervision of established professionals that will greatly increase their value in their future careers. The program enables Infobright to do creative projects we couldn't otherwise get done based on resources and priorities.

This program is designed to give students a chance to work on exciting new projects focusing directly on challenging their skill sets. Candidates who are chosen to participate will get a chance to contribute to existing internal projects and even direct Line of Business (LOB) applications. Our program is designed to help train and prepare candidates for professional positions in our industry. We look for candidates who are well-educated, can communicate effectively and harbor a passion for what they do.

Currently we are looking for interns for Application Development and Marketing Development. As an application developer, our interns will focus on supporting our open source community and developing sample applications using the Infobright database. These applications will be offered via the Infobright Community (http://www.infobright.org). This position is good for someone who wants to be a software developer, a software architect, or some related technical software role.

The marketing intern will assist our Director of Marketing plan and execute outbound marketing campaigns. This person will learn the nuts and bolts of marketing and the use of marketing automation software. This position is good for someone who wants to work within marketing at a B to B (business to business) company within the high tech industry.

Many of our interns have gone on to work for major tech companies and consulting firms. We can say they started here, at Infobright!

Infobright     Tags: internship, uiuc

06
Mar

Thoughts on the PDP 2012 – GPUs… CPUs…

Dominik Slezak's photo
by Dominik Slezak     Tue, Mar 06, 2012

Hello,

Those of you who visit our forums surely know Janusz. If nobody has yet answered your question, Janusz will do so. Janusz is Infobright's architect. He has published a couple of scientific articles on various aspects of scalable computing. He still attends academic events from time to time and usually has very fresh, insightful comments about the presented material. Below you can find his analysis of the very recent conference on Parallel, Distributed and Network-Based Processing hosted by the Leibniz Supercomputing Centre near Munich in Germany.

Best regards,

Dominik

+++++++++++++++

Hi!

I've been to several conferences concerned with parallel computing. Some topics are omnipresent there, e.g. new and more efficient parallel implementations of some numerical algorithms or better tools for parallel programming and performance tuning. Some topics appear and disappear, e.g. using the Sony PS3 (the Cell Engine) for numerical computations. Some topics emerge and stay in focus. I would like to talk about two such topics, both very visible at the recent EuroMicro PDP2012 conference.

Power efficiency is often thought of in regard to laptops and other electronic devices and how long they can run on a battery. For parallel systems one can think about installations composed of thousands of CPUs using megawatts to stay online and additional megawatts for air-conditioning. Power efficiency has a different dimension here and is usually connected to hardware design, starting at the CPU level, through boards, racks, to airflow design in buildings. I take note of a novel approach – power control on an application level. For example, let's test several algorithms for solving a given problem, to choose the best one - not in terms of execution time, but in terms of the power efficiency. Or control the CPU speed within the application and even shut down unused CPUs. This will give an interesting result – the power saving CPU modes (lowering clock frequencies) diminish the power characteristics in watts, but as such a throttled CPU is slower, it takes more time to perform given calculations, leading to higher energy consumption in Joules (watt * time). GPUs (Graphic Processing Unit) can be employed to do number crunching with less electric power if your application can offload some computations to the GPU board.

Yes, the GPU story. General Purpose computations on GPUs (GPGPU), performing arbitrary computations on graphics cards, started several years ago. Someone noticed that the graphic cards can compute much faster than CPUs – they contain hundreds of cores and due to internal massive parallelism can reach TFLOPS performance. If one can cast a computational problem as a graphic operation, then a graphic card will calculate it quickly. In time, programming environments were designed (CUDA, OpenCL) letting an average programmer write C code to be executed on GPUs. Boards with GPUs dedicated to computations, not graphics, were created (e.g. Tesla series). And now look at the top500.org listing the fastest supercomputers in the world – in the top 10 a few installations are powered with GPU accelerators. How come? Oh, because of gaming of course. The modern 3D game graphics needs a lot of computations and everyone wants to play the new games. Result: millions of clients purchasing better graphics cards all over the world. The graphic calculations can be efficiently implemented in silicon; hence GPU peak performance is about 100 times better than CPUs' performance. Additionally the large market demand for GPUs leads to lower prices and frequent new models. With a slightly enhanced functionality and better programming tools we get a fairly cheap and very powerful computing device. However, the computer graphics uses linear algebra and GPUs are fast when calculating linear algebra-like problems. By that we mean regular data structures and operations decomposable into small homogeneous tasks, computable independently in parallel. So the art of GPGPU programming is to find linear algebra-like components in a given (non linear algebra) problem. Sessions on GPU programming became omnipresent on parallel programming conferences and this one was not an exception.

Actually it is not so difficult to port a program onto a GPU card. The problem is to achieve a sizeable speedup over a CPU-only version. Despite new and better GPU-friendly programming environments, achieving good performance requires a lot of technical knowledge and low-level work. The typical points here are:

  • Global memory access coalescing
  • Asynchronous transfers
  • Homogenous branching among threads in a single wave
  • Amount of computations to size of transferred data ratio
  • Tuning cache size versus shared memory within GPU

and other similar topics. So do not expect to get a GPU card and just compile your application for it to get a 10x speedup.

What about GPU usage for query processing? Do we have linear algebra-like problems within queries? It does not look like that. Despite that I am optimistic (though not blindly enthusiastic) in the area of possible GPU usage in query processing. Let me give two examples.

As reported at the conference, someone compiled his application for a GPU board and got no speedup. Apparently his application was not linear algebra-like. So he looked at it more carefully and found a sizable part of it,in some circumstances, could be made regular enough to execute efficiently on the GPU. Offloading the proper part of the computations to the GPU proved to be the right strategy – a large speedup was observed.

The second example is not related to the conference. I read about a database project done in Poland concerned with searching large sets of scientific papers. One could search for occurrences of specific names of chemicals, while a single name can appear in many forms. The algorithm matching documents with the names was computationally heavy and it was implemented on a GPU. A substantial speedup has been achieved, even before advanced technical optimization were applied.

I think there is a place for GPUs in query processing. The key is to find a proper area, where sizeable computations on regular data structures must be performed. This may be connected to some specific usage scenario, specific query type or specific data. We have just to look carefully for it and assess whether the speedup is worth the effort (and GPU hardware).

Best regards,

Janusz


29
Feb

Analytics and Technology

craigtrombly's photo
by craigtrombly     Wed, Feb 29, 2012

Over the last few weeks, I have read about how different organizations are using analytics to solve particular business problems. I have been amazed with the elegance of these solutions and how they might be applied to solve other business issues.  One of my favorite pastimes is to read about how different organizations are utilizing these technologies within their infrastructure and how they make them fit within their workflow and business processes. 

Data analytics is complex and a topic that might often make the traditional OLTP database engineer want to re-schedule their day.  However, when an organization starts to ask questions like "How can we leverage analytics?" or "What can we learn from better data analysis?", the conversations start to become less and less complex and more straightforward.  Data intelligence, which is pure fact, combined with analysis can teach us how to operate even more intelligently.

In one use case by the Canadian Space Agency, I learned that analytics are being used to monitor measurements against strict tolerances, which if missed, could highlight manufacturing flaws in a satellite.  This analytic approach was intuitive and elegant.  Intelligence allows us to overcome the barriers of billions and billions of records and sub-second query response, without having to throw more hardware at a solution.  Elegant and intuitive, this is where analytic approaches to information and software development cross over.

With the need for access to more information, that typically span across increasing large data sets, more and more companies are faced with the challenges of big data.  Delivering fast performance while reducing hardware through very high rates of data compression, and eliminating manual effort is not an easy task, but this is where I have found that the scientific approach Infobright has taken excels. 

In engineering we often train other people how to think when it comes to software development.  We tend to teach users our development process and oftentimes we will shape the end user’s mindset about the approach to an application. There is a tendency to focus more on the technical aspects of the system and information flow, rather than the analysis of the data itself.  I know that I have said at least one or twice before, "That box can hold any information you want it to". However, with data analytics and the kinds of solutions that Infobright offers, I can ask "What kind of information can I provide that will make you better at your job?"

 To read about the Canadian Space Agency and other use cases, click here http://www.infobright.com/Customer/canadian_space_agency/

 

Infobright     Tags:

23
Feb

Connecting Infobright and Talend

Jeff Kibler's photo
by Jeff Kibler     Thu, Feb 23, 2012

Fresh off the heels of Talend’s latest major release (v5), I noticed a surge in activity on our user forums regarding the installation of Talend using Infobright.  To consolidate our recent lessons learned, I wanted to describe the most efficient method for connecting Talend to Infobright.  If you have any questions, please feel free to post to our user forum; we’ll be happy to tackle any issue you raise.

These instructions assume that you have Infobright installed and running.  

  1. First and foremost, download Talend.  In this example, we will download Talend Open Source Data Integrator v5.0. (http://www.talend.com/download.php
  2. Once fully installed, download the Talend/Infobright Connector.  Ensure you download the right connector; instructions are on the download page (http://www.infobright.org/Downloads/Contributed-Software/)
    • If you download Talend 4.0+, you’ll want the latest connector
    • For older versions of Talend, you’ll want the 3.7 connector and lower.
  3. Once downloaded, perform the following actions:
    • [For Windows] Copy the infobright_jni([_32|_64])bit.dll to C:\Windows\infobright_jni.dll
    • Copy the zipped “tInfobrightOutput” directory to this directory: [Install Root of Talend] \plugins\org.talend.designer.components.localprovider_5.0.1.r74687\components\tInfobrightOutput
    • Copy “infobright-core-3.4.jar” to [Install Root of Talend]\lib\java
  4. Running Talend in Windows
    • If using Windows, run talend as Administrator.  If you don’t, you will see odd “Access Denied” or “Accesse Refuse” error messages when trying to use the connector.

Now, run Talend.  At this point, you should see “tInfobrightOutput” component listed underneath Databases > MySQL.  When you use the component, you will want to specify which version of Infobright you’re using: IEE or ICE.  

Infobright     Tags: talend

16
Feb

Implementing the RANK Function in Infobright

Infobright
by Damian McKillop     Thu, Feb 16, 2012

Within MySQL and Infobright, the rank function is sometimes requested. There are varying ways to include rank into your query, but we wanted to highlight one potential option.

Use knowledge about the query to inject how the rank operation should be considered. For example, let's assume you have two main products: 'milk' and 'cookies'. These are great products, and you want to count which sales people (ID numbers: 1, 3, 4, 5, and 19) are your best sellers of your two amazing products.

We will make a few assumptions about your rank:

  1. Your Number 1 Sales Person (rank=1) is the person with the highest total sales!
  2. You want to group by products. Just because someone sells a million 'milks' doesn't mean they're better than someone who sells '30' cookies.
  3. Assume that salesman_ids are unique. We don't want to give double credit for our workers!
  4. This is not a dense_rank function. In other words, if three employees have the same value, the resulting ranks would be 1,1,1,4. A dense rank would return a result of 1,1,1,2. We will soon be releasing the dense_rank function in an upcoming blog.


Here's how you'd compute a rank function in Infobright.

create database foo;
use foo;

create table foo.other_table (
salesman_id int,
product_name varchar(10),
total_sales int
);

insert into foo.other_table values (5, 'milk', 1000000);
insert into foo.other_table values (3, 'milk', 2000000);
insert into foo.other_table values (4, 'milk', 3000000);
insert into foo.other_table values (1, 'milk', 500);
insert into foo.other_table values (19, 'milk', 1500);
insert into foo.other_table values (5, 'cookies', 100);
insert into foo.other_table values (3, 'cookies', 20000);
insert into foo.other_table values (4, 'cookies', 30);
insert into foo.other_table values (1, 'cookies', 50);
insert into foo.other_table values (19, 'cookies', 150);

select * from (
Select foo.salesman_id, foo.product_name, foo.total_sales,
(select 1 + count(*)
from other_table zy
where zy.product_name = foo.product_name
and zy.total_sales > foo.total_sales
order by zy.product_name, zy.total_sales, zy.salesman_id) rank
from other_table as foo
) as xy order by xy.product_name, xy.rank

+-------------+--------------+-------------+------+
| salesman_id | product_name | total_sales | rank |
+-------------+--------------+-------------+------+
| 3 | cookies | 20000 | 1 |
| 19 | cookies | 150 | 2 |
| 5 | cookies | 100 | 3 |
| 1 | cookies | 50 | 4 |
| 4 | cookies | 30 | 5 |
| 4 | milk | 3000000 | 1 |
| 3 | milk | 2000000 | 2 |
| 5 | milk | 1000000 | 3 |
| 19 | milk | 1500 | 4 |
| 1 | milk | 500 | 5 |
+-------------+--------------+-------------+------+
10 rows in set (0.01 sec)


Now, what happens when we have two workers with the same sales? Let's add employee_id number 2, and let's assign him/her the same sales as others in the company.

insert into foo.other_table values (2, 'milk', 1500); -- Same as Employee #19
insert into foo.other_table values (2, 'cookies', 30); -- Same as Employee #4

+-------------+--------------+-------------+------+
| salesman_id | product_name | total_sales | rank |
+-------------+--------------+-------------+------+
| 3 | cookies | 20000 | 1 |
| 19 | cookies | 150 | 2 |
| 5 | cookies | 100 | 3 |
| 1 | cookies | 50 | 4 |
| 4 | cookies | 30 | 5 |
| 2 | cookies | 30 | 5 |
| 4 | milk | 3000000 | 1 |
| 3 | milk | 2000000 | 2 |
| 5 | milk | 1000000 | 3 |
| 19 | milk | 1500 | 4 |
| 2 | milk | 1500 | 4 |
| 1 | milk | 500 | 6 |
+-------------+--------------+-------------+------+
12 rows in set (0.02 sec)


If you have any other functions you'd like us to discuss, please contact us at (JavaScript must be enabled to view this email address).

Infobright     Tags: function, rank

15
Feb

Infobright and Analytics

craigtrombly's photo
by craigtrombly     Wed, Feb 15, 2012

As a software engineer myself, I have a strong background working with DBMS systems like SQL Server and MySQL. Not just from building web applications, but also from a n-Tier application architecture approach. I tend to use standard approaches to architecture and modeling, so like any engineer, I have come to test the limits and drawbacks of SQL Server and other OLTP databases. Building out the common business, data and presentation layers of any application tends to meet the many challenges of keeping those layers separate and autonomous.

I have worked in many different industries ranging from healthcare and education to wholesale and retail, and one of the biggest problems that I always seem to encounter in my projects is ad-hoc querying in the reporting modules. How do you keep the database tuned for queries when you do not know what the query will be? SQL Server certainly has caching and even the ability to cache the actual query and the data separately, but when the CFO wants to look at common trends against the corporate database for spending with daily, weekly and monthly increments for sales engineers in the northeast regions......

You get the picture. Why a different approach?

Because like any other DBMS, the drawbacks of traditional OLTP systems are that they are not built around an analytical approach. They do not focus on the statistical data that is necessary for reporting. I am a strong supporter of data analytics and feel that if your organization is not using them, then you are not competing at the level you can be. Infobright solves this problem with an approach that is based on performance, scalability and lowering overall costs.

In the next few months, I plan to learn the underlying architecture of Infobright and discover new ways of integrating Infobright into suitable analytic applications. I will do in-depth analysis of Use Cases and I look forward to learning how organizations are using Infobright within their infrastructure. Please feel free to contact me at craig.trombly @ infobright.com with your experiences and use cases.

Infobright     Tags: analytics

14
Feb

Hadoop for President

Jeff Kibler's photo
by Jeff Kibler     Tue, Feb 14, 2012

Last Wednesday (Feb 8th), I had the pleasure of speaking at the LA/Web Mobile Software Entrepreneurs meetup talking about "Data Warehousing for Ambitious Young Startups." (http://bit.ly/AAx7y1) Led by William Belk, this group boasts over 1,700 members, and in this past meetup, over 200 entrepreneurs attended.

The meetup lasted a bit long -- nearly 2.5 hours to get through all eight presentations. After a long day at the office or on a plane, most attendees probably enjoyed the social events more so than the speakers. Yet, they persevered, and we got through all eight topics. Yet, the most interesting aspects were not the presentations themselves; instead, there appeared to be undertones and subtleties that dug into the 'competition.'

Overwhelmingly, the conversation was about Hadoop. In fact, within the first sixty minutes, I had the distinct impression that Hadoop was the *only* way to go. Interestingly, the varied Hadoop and big data vendors began to show their muscle; one proclaimed the end of RDBMS' in general. I find that notion to be far-fetched; I would argue that you'll find specialized databases that are very good at their niche. The Oracle "one-size-fits-all" viewpoint is morphing. In my opinion, the data centers of the future will have several, specific data storage/retrieval options within their warehouse.

Nonetheless, after all of the Hadoop love, I gave a shout-out to my RDBMS brethren. Whether you're MySQL, PostgreSQL, Infobright, Netezza, or someone else, relational databases perform quite well in many environments. Big data isn't just for behemoths; big data is for the guys who do more with less. Your server farm may handle 10 petabytes worth of data, but TODAY, your average company may hold 5 Terabytes worth of data. While I concede that data growth is massive (40% average per year), the math shows growth to just 27 Terabytes in the next five years. [Data Pulled from Geminare's "Navigating the Digital Haystack"].

Infobright holds 50TB+ in just one node. In fact, one of our ICE users reports over 127 Terabytes in just one node. With the great compression and query capabilities, Infobright is a great choice when you fit into this paradigm:

  • Data is growing very quickly
  • You hardly (if ever) change the data with update statements
  • The data is structured or can be structured
  • You are running analytic queries on it
  •  

    With that criteria, we could be a great fit for you. Plus, I've just also told you what we are *not* designed to handle. We're not OLTP. We're not NoSQL or NewSQL. Plus, we are proud to point you in the right direction when you *do not* fit our criteria.

    After the talk, several (at least a dozen) approached me to thank me for the candid remarks; they also wanted a copy of our “Emerging Database Landscape” whitepaper which outlines the pros/cons of many current databases. While Infobright must sell to keep the lights on, we also hope to sell to the right customer (and not just to sell blindly). It's in everyone's best interest to buy their software for the right reasons. Anyone who tells you they are the end-all, be-all, caveat emptor (even if they are Hadoop-based).


    08
    Feb

    Got Analytics?

    craigtrombly's photo
    by craigtrombly     Wed, Feb 08, 2012

    Hello, my name is Craig Trombly. First and foremost, I would like to thank all of you for welcoming me as the new Community Manager here at Infobright. I look forward to meeting, discussing and learning from each of you. Who am I? I am a proud father of two incredible daughters, Kaylah and Kyara. I am a software engineer/architect at heart; I truly love to build and learn new technologies. Building software and business applications has been a passion of mine for over 15 years. I also enjoy the outdoors, playing golf and I am an avid poker player.

    In my professional life, I built a software company from the ground up and led a team of engineers and managers through a host of custom software projects for clients. I have worked with many different platforms and software needs across different industries, and look forward to bringing that experience to Infobright. As the new Community Manager, some of my key goals are to expand the reach of the community, foster greater interaction, represent the views of the community back to Infobright and to improve the user experience with our software, our website, our documentation and more. Through outreach, collaboration and an analytical approach combined with years of hands on-experience, I intend to achieve these results. I encourage you to please reach out to me and discuss your ideas and concepts that may benefit my process. I can be reached at craig dot trombly at infobright dot com.

    Infobright     Tags:

    Previous Page   Next Page