Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
BH loader vs Mysql loader
Posted: 16 November 2010 10:41 AM   Ignore ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

This is for IEE version.
I noticed that BH loader is not faster than native MySQL loader if table contains a lot or only varchar columns.
Please see below. Am I missing something?

CREATE TABLE `test` (
`a` varchar(60) DEFAULT NULL,
`b` varchar(60) DEFAULT NULL,
`c` varchar(60) DEFAULT NULL,
`d` varchar(60) DEFAULT NULL
) ENGINE=BRIGHTHOUSE;
Query OK, 0 rows affected (0.01 sec)

mysql> set @bh_dataformat=‘txt_variable’;
Query OK, 0 rows affected (0.00 sec)

mysql> load data infile ‘/tmp/a’ into table test fields terminated by ‘,’;
Query OK, 1000000 rows affected (2.46 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> set @bh_dataformat=‘mysql’;
Query OK, 0 rows affected (0.00 sec)

mysql> load data infile ‘/tmp/a’ into table test fields terminated by ‘,’;
Query OK, 1000000 rows affected (2.57 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

Java code to generate data:
for (int i = 0; i < 1000000; i++) {
  double a = i/0.3; int b = i; String c = String.valueOf(a); String d = String.valueOf(b);
  System.out.println(c + “,” + d + “,” + c + “,” + d);
}

Profile
 
Posted: 16 November 2010 11:11 AM   Ignore ]   [ # 1 ]  
Member
RankRankRank
Total Posts:  223
Joined  2008-08-18

I would say it would be better to try bigger dataset for a fair comparison.
As you loaded by bhloader first, I/O blocks may already have cached for mysql loader because of loading the same file.

You can also clean OS cache but I do not think it completely cleans cache. You need to be root.

sync; echo 1 > /proc/sys/vm/drop_caches; echo 2 > /proc/sys/vm/drop_caches; echo 3 > /proc/sys/vmt/drop_caches
Restart mysqld
run bhloader
sync; echo 1 > /proc/sys/vm/drop_caches; echo 2 > /proc/sys/vm/drop_caches; echo 3 > /proc/sys/vmt/drop_caches
Restart mysqld
run mysql loader

Signature 

Mahib

Profile
 
Posted: 16 November 2010 11:58 AM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

Thanks for reply.
> better to try bigger dataset
I tried (just did not want to make the post too big) to load a table with 60 columns (70% are varchars), 1 million rows with the same effect.
> may already have cached for mysql loader
I do not think it is the case because I tried to switch back and force from bh to mysql and hence the file was (if) cached in both tests.
Moreover I compared ICE and IEE loading the very same file. I was not able to see much difference either while comparison spec says that IEE supports multi-threaded loading and hence should be 5-6 times faster.

Profile
 
Posted: 17 November 2010 09:11 AM   Ignore ]   [ # 3 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  964
Joined  2008-08-18

Hi !

For “only varchar columns” IB loader may not be faster than mysql, because the text compression becomes a bottleneck in both cases.
IEE loader is faster than ICE loader if
- you have a multicore/multi CPU computer
- the loader has been configured (during setup) to use many threads
For the latter see file .infobright in datadir, look for LoaderSaveThreadNumber

Profile
 
Posted: 17 November 2010 09:46 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

Thank you for reply.

> you have a multicore/multi CPU computer
Indeed. I use 24 core Intel(R) Xeon(R) E7450 @ 2.40GHz with 128 Gb or RAM

> look for LoaderSaveThreadNumber
The spec (IEE) says:
With IEE 3.4.2 GA, several tuning parameters are now confgured automatically and have
been deprecated… The deprecated parameters include:
LoaderSaveThreadNumber

Profile
 
Posted: 17 November 2010 12:10 PM   Ignore ]   [ # 5 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  964
Joined  2008-08-18

Hi!
See file .infobright in datadir, look for LoaderSaveThreadNumber to check how it has been configured automatically.

Also, during the load you can use ‘top’ tool to see CPU usage of bhloader process. It should use more than 1 core, which should translate on your system to more than 100/24 %

[ Edited: 17 November 2010 12:21 PM by Janusz Borkowski]
Profile
 
Posted: 17 November 2010 12:39 PM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  738
Joined  2008-08-18

.infobright file can be found in directory defined as ‘basedir’ in /etc/my-ib.cnf, by default the directory that is linked to /usr/local/infobright.

Signature 
Profile
 
Posted: 17 November 2010 12:47 PM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

> See file .infobright
IEE uses dirrefent setup. File is called infobright.ini and it does not reflect all settings. There is a script confman.sh though that shows configuration:
./confman.sh—show-config=yes

Infobright Advanced Configuration Manager.
—————————————————————
Using defaults-file /etc/my-ib.cnf.
Using basedir /export/home/infobright-3.4.2/infobright
Option: brighthouse.version, value: 1.0.0
Option: brighthouse.prefetch.threads, value: 8
Option: brighthouse.prefetch.queuelength, value: 32
Option: brighthouse.prefetch.depth, value: 2
Option: brighthouse.servercompressedheapsize, value: 0.1
Option: brighthouse.loadersavethreadnumber, value: 16
Option: brighthouse.HugefileDir, value:
Option: brighthouse.clustersize, value: 2000
Option: brighthouse.cachinglevel, value: 1
Option: brighthouse.bufferinglevel, value: 2

> on your system to more than 100/24 %
top shows that bhloader is using ~140% of CPU
When I use mysql loader CPU goes up to 230%

Profile
 
Posted: 19 November 2010 12:30 PM   Ignore ]   [ # 8 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello,

We’ve been trying to find out potential reasons for the facts that:

1. MySQL loader takes more CPU than bh

2. bh is not faster than MySQL loader

3. Available CPU is not fully used

The reason may be that the considered varchar values are of relatively simple form (basing on their generator described in your first post). This may mean that compression is too simple to use more CPU. It would also partially explain the reason why MySQL takes more CPU. It needs to do more parsing, which may be visible when compression is relatively simple. For the same kind of reason, given that compression does not take whole CPU and there is enough of it left for parsing, you may not see differences in speed between bh and MySQL loader.

However, the above are just my guesses. I’d have two suggestions:

1. Please try with more complicated varchars, so compression becomes more difficult

2. Please try with more data, so the measurements become (maybe) more reliable

Best greetings,

Dominik

Signature 
Profile
 
Posted: 19 November 2010 12:43 PM   Ignore ]   [ # 9 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

Thank you for reply.
As I mentioned earlier I posted just a simpified example not to polute forum with complex data structures.
In reality I was playing with more complex data, 60 columns with varchars, ints, decimals, date and time. The results were fairly the same.
I’ve sent a sample file (200 Mb) to IEE support team but got no resolution so far. If you have contacts with them you might want to ask for a ticket 1019528
Eugene

Profile
 
Posted: 20 November 2010 03:56 AM   Ignore ]   [ # 10 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Eugene,

Thanks a lot for pointing at the ticket. I studied your discussion with Kethees over there. On the one hand, in the case of 60 columns, the question about difference between ICE and IEE load speed seems to be solved (multi-threading). On the other hand, questions about CPU behavior and MySQL/BH comparison remain open, especially given the fact that you observed it on larger data.

Unfortunately, I can’t download your data sample during the weekend. I need to wait until Monday. I need to check the values in varchar columns. The most important for me right now is whether those values are more like single characters or more like long strings. (This is what I meant by “relatively simple” in my previous post.) So, please give me time to Monday/Tuesday. But if you are able to paste a few rows from your sample in this forum thread before then, I’ll be hopefully able to reply sooner.

Best greetings,

Dominik

Signature 
Profile
 
Posted: 22 November 2010 01:17 PM   Ignore ]   [ # 11 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello again,

I took a look at ddl and sample of data today.

There seem to be relatively many nulls. Also, many varchars are with lookup comment. I think that there is only one varchar column that is not lookup and has relatively complex values. But please correct me if I’m wrong in any of these observations.

Our compression is parallelized roughly as follows: We take a “packrow” (2^16 rows) and parse it into data packs, one for each column. Compression of those data packs can be then done in parallel. Now, imagine that all data packs within a given packrow are compressed except just one of them. I asked my colleague what would happen then. He says that in our current implementation, if this last data pack is an un-lookuped varchar, then we need to wait till its compression is finished before starting compressing data packs from the next packrow. Surely it’s just a hypothesis but it may justify what you see on CPU. Simply, one has to finish job and the others can’t start anything new. On average, it may look as you observed.

We’ll keep improving our loader. grin

Best greetings,

Dominik

Signature 
Profile
 
Posted: 23 November 2010 12:02 PM   Ignore ]   [ # 12 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

Dominik,
Thank you for spending time on that.

> Also, many varchars are with lookup comment
I did that experiment as well. I removed all “lookups”. The loading was about 3% faster.
The comparison table (ICE vs IEE) says that IEE supports multi-threading load up to 300Gb/hr vs ICE’s 55Gb/hr. I cannot achieve such results.

Profile
 
Posted: 23 November 2010 12:12 PM   Ignore ]   [ # 13 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Eugene,

ionna - 23 November 2010 12:02 PM

The comparison table (ICE vs IEE) says that IEE supports multi-threading load up to 300Gb/hr vs ICE’s 55Gb/hr. I cannot achieve such results.

I’m sorry to hear it. I’m not sure what to advise at this stage. These numbers are just an average reference point. Everything depends on data. I hope you’ll get better speed with the upcoming version of ICE/IEE.

ionna - 23 November 2010 12:02 PM

I removed all “lookups”. The loading was about 3% faster.

If the columns have low numbers of distinct values it’s better to keep them as “lookuped”. By the way, did you see any difference in compression ratios when giving up lookups? And how about CPU usage when you give up lookups?

Thanks and greetings!

Dominik

Signature 
Profile
 
Posted: 23 November 2010 12:25 PM   Ignore ]   [ # 14 ]  
Newbie
Rank
Total Posts:  15
Joined  2010-03-30

Dominik,
No, I did not have a look at compression level or CPU usage without “lookups”. In my case it is not that critical. We are working with large “live” data volumes and loading speed is of more importance. Maybe I should try binary loading. By the way, how does binary loading handle errors? If there is an error in ascii file loading I can catch it but I am not sure if it is true for binary.
I have no complains on querying though smile

Profile
 
Posted: 24 November 2010 11:00 AM   Ignore ]   [ # 15 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  487
Joined  2008-08-18

Hello Eugene,

Sorry for delay in replying…

Actually, I’m trying to gether more information about error handling.

(Maybe one of my colleagues will answer to this. Anyway, it will be answered soon.)

Binary load might provide higher speed indeed. It’s always worth trying.

ionna - 23 November 2010 12:25 PM

I have no complains on querying though smile

It’s good to know!!!

Best regards,

Dominik

Signature 
Profile
 
   
1 of 2
1