Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
Fast Data Load and comparison
Posted: 19 January 2009 09:02 AM   Ignore ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

Dear Developers-Users,

First of all I would like to thanks all of your afforts to birng us these DB technologies and a platform for openly discussion.

We are consulting big sized customers and deploying their data search technologies.

Total size is around 1 petabytes daily . (total of our customers number ) .

Their data size starts 1 TB (9.000.000.000 lines cvs)  to 100 TB. so we have different size of customer base.

We are looking opensource db solutions for a while due to lack of support and technological cumbersome of the current commercial solutions.

We have 3 test customer now :

1-1 TB data dailiy read only purposes. 9.000.000.000 cvs file should be load to database . Data should be load to the database max 2 hours. after that comlex crm analitic queries. max 10 concurrent user/queries.  database will be daily and wil not grow dailiy. so limit is 1 TB data load.

2-10 TB data daily. same as above ..

3-50 TB data stabil. will not grow.


So we are looking for a solutions:

*Fast data loading. How fast we load data to the Infobright ? Is it could be multithreaded data loading possible ?

*Is it possible to MPP ? can we add more server to share cpu resources and decrease the load time ?

what is the best HW solution for above first customer ? Do we need still fat enterrprise servers or what ?


We are also looking for MonetDb, Vertica and also Luciddb. VErtica is not opensource so out of the list.

MonetDB seems very fast but lack of admin and docs..


So can we do this with Info Bright community edition ?

Also is therea mac os x source for compiling?

also I coulnt find 32 bit linux source code in the site

Best Regards

W.S

Profile
 
Posted: 19 January 2009 03:08 PM   Ignore ]   [ # 1 ]  
Administrator
Avatar
RankRankRankRank
Total Posts:  519
Joined  2008-07-08

Hi,

I just wanted to drop you a line to let you know what someone will be getting back to you shortly. We haven’t forgotten you!

Regards,
Mark

Signature 
Profile
 
Posted: 19 January 2009 04:01 PM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

Prometheus,

Thanks for joining the Infobright community and posting the details of your opportunity.  Welcome!  I’m happy to respond to your questions and will do so inline with your original message.

*Fast data loading. How fast we load data to the Infobright ? Is it could be multithreaded data loading possible?

Infobright has the highest load speed per data server and it scales very well with multiple, parallel load processes.  The following values are for a single SMP server, therefore my reference to “per data server”.  For example, we have seen load speeds of 100GB/hour for a single table load process.  When we increased the number of tables being loaded to two, we saw two individual throughput speeds of 90GB/hour/table, or 180GB/hour.  With three tables and load processes, 80GB/hour/table, or 240GB/hour.  Likewise at 4 concurrent load processes, 70GB/hour/table or 280GB/hour at the aggregate level.

This test was performed with the Infobright Enterprise Edition which is multi-threaded.  The Community Edition, or ICE, is single-threaded so performance levels vary depending on table structures and the actual data, but are somewhere in the neighborhood of 50% to 60% of the multi-threaded Infobright Loader.

*Is it possible to MPP ? can we add more server to share cpu resources and decrease the load time ?

Infobright is built on a a single, SMP architecture.  Think of this as a “cluster of one” with a shared disk MPP architecture.  Resources can be added to the single server - such as CPUs and memory - to increase all aspects of performance.  We are currently in Design and Development of a multi-server version of this architecture that will remain a shared disk MPP architecture.  This will allow the incremental expansion of “front-end” resources for user connections and compute power as you suggest.  It also facilitates scaling the computing capacity separately from the storage capacity, as opposed to a shared-nothing MPP architecture.  The timeline for delivery of this architecture is still being evaluated.

What is the best HW solution for above first customer ? Do we need still fat enterrprise servers or what?

Infobright is agnostic on specific hardware platforms.  Infobright supports both Intel and AMD 32- and 64-bit chipsets and multiple varieties of Linux.

For your first customer, let me make sure I understand the basics:  They will load 1TB once daily and then run analytical queries against it.  On subsequent dates, the previous data will be dropped (or otherwise unloaded from Infobright) and the next day’s data will be loaded.  The storage and query requirements are all very well-aligned with ICE’s capabilities.  I’m sure you will see this in your experience.

The critical component here would seem to be loading 1TB in no more than 2 hours, or 500GB/hour.  As you have probably seen in the previous response on loading, the “scale factor” is the number of parallel load processes that are occurring at any time and the aggregate throughput.  (I would fist want to make sure the server could “deliver”, or read, the source data at that rate.)

We haven’t tested beyond 4 parallel loads on a dual CPU/dual core processor (or 1 core/load process) but if I extrapolate the 10% single table throughput rates, I would expect to see approaching 400GB/hour with 8 parallel load processes.  Is the database design such that 8 parallel loads could be executed?  In this case, I would recommend a minimum of 8 CPU cores and no less than 32GB of RAM.  More of both would be beneficial.

As always, your results will depend on the data model, the data, achieved compression ratios and external factors such as how fast the source data can be “fed” to the Infobright Loader and what other loads are on the server at that time.

So can we do this with Info Bright community edition?

From the information you’ve provided, I think Infobright would be a very good fit for your proposed projects.  Please give it a test and let us know how it works for you!

Also is there a Mac OS X source for compiling?

Infobright has not yet developed a Mac OS X port yet but the 64-bit source code is available for testing compilation on Intel-based Macs.  Again, if you attempt this, please let us know your results.

Also I couldn’t find 32 bit Linux source code in the site

We are looking into this right now.  Please check back for an authoritative response from our Community VP.

Best wishes!

[ Edited: 19 January 2009 07:01 PM by David Lutz]
Signature 
Profile
 
Posted: 19 January 2009 05:18 PM   Ignore ]   [ # 3 ]  
Administrator
Avatar
RankRankRankRank
Total Posts:  519
Joined  2008-07-08

Prometheus,

I looked into your question about 32-bit source for ICE.

The code base is the same for 32-bit and 64-bit – the choice of binary is configured during the build.

The instructions for building either 32-bit or 64-bit are supplied with the sources. If you look at the README (distributed with the sources):

NoteYou can customize /etc/my-ib.cnf file by changing portsocket etc.
    If 
you are compiling on a 32 bit systemyou need to rename brighthouse.ini.32bit
    
(can be found in package dir /usr/local/infobright/share/mysqletc.) to brighthouse.ini
    
The existing brighthouse.ini is configured with higher memory settings suitable 
    
for a 64 bit system

Hope this helps,
Mark

Signature 
Profile
 
Posted: 19 January 2009 06:41 PM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

Thank you for your fast reply and kind answers.  I will let you know the results soon.

Profile
 
Posted: 20 January 2009 07:39 AM   Ignore ]   [ # 5 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi !

Some comments on load speed are in http://www.infobright.org/Forums/viewthread/375/

Profile
 
Posted: 20 January 2009 08:40 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

Installation problem:

Press Y -I agree, Any other key -I do not agree [Y/*]:y
Creating user mysql… /usr/sbin/useradd mysql
Creating mailbox file: File exists
useradd: warning: the home directory already exists.
Not copying any file from skel directory into it.
User mysql is created.
Installing default databases…
Installing all prepared tables
090120 13:40:38 [ERROR] Brighthouse: Can not access folder /root/download.
090120 13:40:38 [ERROR] Plugin ‘BRIGHTHOUSE’ init function returned error.
090120 13:40:38 [ERROR] Plugin ‘BRIGHTHOUSE’ registration as a STORAGE ENGINE failed.
090120 13:40:38 [ERROR] Failed to init plugins.
090120 13:40:38 [ERROR] Aborting

090120 13:40:38 [Note] /root/download/infobright-3.0-i686/bin/mysqld: Shutdown complete

Installation of system tables failed!

Examine the logs in /root/data/ for more information.
Failed on: scripts/mysql_install_db—defaults-file=/etc/my-ib.cnf—user=mysql—basedir=/root/download/infobright-3.0-i686—datadir=/root/data/—cachedir=/root/download/infobright-3.0-i686/cache/
Rolling back the installation due to unexpected failure…
Installation failed! Please investigate the error messages.

Profile
 
Posted: 20 January 2009 11:26 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

Prometheus,

Were you executing the installation as user ‘root’ or a user that had root privileges?  Also, what is the environment in which you are attempting to install - server, OS, RAM, etc.

Signature 
Profile
 
Posted: 20 January 2009 12:52 PM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

executing as root.  Environment is Centos .  4 gig ram x86 cpu

Profile
 
Posted: 20 January 2009 01:05 PM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

This might be a silly question, but does the directory /root/download exist?  These are not the default path values so I assume you chose them specifically. 

090120 13:40:38 [ERROR] Brighthouse: Can not access folder /root/download.

Failed on: scripts/mysql_install_db—defaults-file=/etc/my-ib.cnf—user=mysql—basedir=/root/download/infobright-3.0-i686—datadir=/root/data/—cachedir=/root/download/infobright-3.0-i686/cache/

If it helps, the Install Guide for ICE can be found here.

Signature 
Profile
 
Posted: 20 January 2009 01:49 PM   Ignore ]   [ # 10 ]  
Member
RankRankRank
Total Posts:  218
Joined  2008-08-18

Hi Prometheus

There is a minor bug in our binary around the permission issue you are getting. As a work around you can try installing ICE in a non-home folder(i.e not inside /root). I saw some community users solved their problem this way.

best regards

Signature 

Mahib

Profile
 
Posted: 21 January 2009 04:14 PM   Ignore ]   [ # 11 ]  
Member
RankRankRank
Total Posts:  106
Joined  2008-08-18

Hi Prometheus,

I monitored the ICE/IEE loading speed. The highest number 95GB/hour on multi core Intel XEON 3.0G CPU machine, which also has SCSI hard drive installed.

My experience tells me:

- Infobright loader does not take much memory, 800K is good enough, regardless the raw data file size. The loading speed is also consistent for both small and large files.
- Randomness of raw data is the key factor, because Infobright will compress it and then store in custom binary format. Faster CPU will definitely help.
- You can run multiple loading session at the same time, to different tables. Infobright loader will lock table, and block other operations upon the same table. I tried 2 sessions, but at that time, hard drive becomes bottle neck not Infobright.

I would conclude that you can achieve up to 200G/hour on a normal Intel server, depend on your data.

Thanks

Profile
 
Posted: 21 January 2009 05:05 PM   Ignore ]   [ # 12 ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

We will make a test 8 core intel machine this week. 64 gig ram. If I can load 200 Gig data in 1 hour we will start to use Infobright. By the way I assumed that the query speed is in seconds not minutes smile

I will let you know about the results.

Profile
 
Posted: 22 January 2009 04:52 AM   Ignore ]   [ # 13 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi !

Infobright loader does not take much memory, 800K is good enough


An exception is if loading long text/char/varchar. The compression can can up to 10x more RAM than the source data. For a column VARCHAR(10000) indeed containing long strings loader can take a few GB RAM. But it is a rare case and you do not need to adjust LoaderHeapSize for it. And providing that you have 64BG RAM you are unlikely to start swapping.

Profile
 
Posted: 04 February 2009 10:52 AM   Ignore ]   [ # 14 ]  
Newbie
Rank
Total Posts:  9
Joined  2009-01-19

Hi ,

We made our test with both ICE and MonetDb last weekend. Same table same procedure.

We import 2.5 billion data to DB and than make queries.

MonetDb took 2.25 hour to get all 2.5 billion data import (around 300Gb)

queries in seconds (0.1 sec to 100 sec)

Same data in infobright ICE 3 times longer to import..

Queries took 4-5 times longer sometimes minutes.

MonetDb required more memory..  sometimes entering swaps..

ICE using very low memory and more cpu. But I expected to ICE more faster.. due to its compression.

I would like get advise here from ICE experts.  may be I am doing something wrong.

thanks

Profile
 
Posted: 04 February 2009 11:19 AM   Ignore ]   [ # 15 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi !

How much memory have you given to ICE (MainHeapSize) ?
Compression can actually slow down queries, which need to access all rows - all rows must be decompressed and it takes time.

Profile
 
   
1 of 2
1