This is for IEE version.
I noticed that BH loader is not faster than native MySQL loader if table contains a lot or only varchar columns.
Please see below. Am I missing something?
mysql> set @bh_dataformat=‘txt_variable’;
Query OK, 0 rows affected (0.00 sec)
mysql> load data infile ‘/tmp/a’ into table test fields terminated by ‘,’;
Query OK, 1000000 rows affected (2.46 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0
mysql> set @bh_dataformat=‘mysql’;
Query OK, 0 rows affected (0.00 sec)
mysql> load data infile ‘/tmp/a’ into table test fields terminated by ‘,’;
Query OK, 1000000 rows affected (2.57 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0
Java code to generate data:
for (int i = 0; i < 1000000; i++) {
double a = i/0.3; int b = i; String c = String.valueOf(a); String d = String.valueOf(b);
System.out.println(c + “,” + d + “,” + c + “,” + d);
}
I would say it would be better to try bigger dataset for a fair comparison.
As you loaded by bhloader first, I/O blocks may already have cached for mysql loader because of loading the same file.
You can also clean OS cache but I do not think it completely cleans cache. You need to be root.
Thanks for reply.
> better to try bigger dataset
I tried (just did not want to make the post too big) to load a table with 60 columns (70% are varchars), 1 million rows with the same effect.
> may already have cached for mysql loader
I do not think it is the case because I tried to switch back and force from bh to mysql and hence the file was (if) cached in both tests.
Moreover I compared ICE and IEE loading the very same file. I was not able to see much difference either while comparison spec says that IEE supports multi-threaded loading and hence should be 5-6 times faster.
For “only varchar columns” IB loader may not be faster than mysql, because the text compression becomes a bottleneck in both cases.
IEE loader is faster than ICE loader if
- you have a multicore/multi CPU computer
- the loader has been configured (during setup) to use many threads
For the latter see file .infobright in datadir, look for LoaderSaveThreadNumber
> you have a multicore/multi CPU computer
Indeed. I use 24 core Intel(R) Xeon(R) E7450 @ 2.40GHz with 128 Gb or RAM
> look for LoaderSaveThreadNumber
The spec (IEE) says:
With IEE 3.4.2 GA, several tuning parameters are now confgured automatically and have
been deprecated… The deprecated parameters include: LoaderSaveThreadNumber
Hi!
See file .infobright in datadir, look for LoaderSaveThreadNumber to check how it has been configured automatically.
Also, during the load you can use ‘top’ tool to see CPU usage of bhloader process. It should use more than 1 core, which should translate on your system to more than 100/24 %
> See file .infobright
IEE uses dirrefent setup. File is called infobright.ini and it does not reflect all settings. There is a script confman.sh though that shows configuration:
./confman.sh—show-config=yes
We’ve been trying to find out potential reasons for the facts that:
1. MySQL loader takes more CPU than bh
2. bh is not faster than MySQL loader
3. Available CPU is not fully used
The reason may be that the considered varchar values are of relatively simple form (basing on their generator described in your first post). This may mean that compression is too simple to use more CPU. It would also partially explain the reason why MySQL takes more CPU. It needs to do more parsing, which may be visible when compression is relatively simple. For the same kind of reason, given that compression does not take whole CPU and there is enough of it left for parsing, you may not see differences in speed between bh and MySQL loader.
However, the above are just my guesses. I’d have two suggestions:
1. Please try with more complicated varchars, so compression becomes more difficult
2. Please try with more data, so the measurements become (maybe) more reliable
Thank you for reply.
As I mentioned earlier I posted just a simpified example not to polute forum with complex data structures.
In reality I was playing with more complex data, 60 columns with varchars, ints, decimals, date and time. The results were fairly the same.
I’ve sent a sample file (200 Mb) to IEE support team but got no resolution so far. If you have contacts with them you might want to ask for a ticket 1019528
Eugene
Thanks a lot for pointing at the ticket. I studied your discussion with Kethees over there. On the one hand, in the case of 60 columns, the question about difference between ICE and IEE load speed seems to be solved (multi-threading). On the other hand, questions about CPU behavior and MySQL/BH comparison remain open, especially given the fact that you observed it on larger data.
Unfortunately, I can’t download your data sample during the weekend. I need to wait until Monday. I need to check the values in varchar columns. The most important for me right now is whether those values are more like single characters or more like long strings. (This is what I meant by “relatively simple” in my previous post.) So, please give me time to Monday/Tuesday. But if you are able to paste a few rows from your sample in this forum thread before then, I’ll be hopefully able to reply sooner.
There seem to be relatively many nulls. Also, many varchars are with lookup comment. I think that there is only one varchar column that is not lookup and has relatively complex values. But please correct me if I’m wrong in any of these observations.
Our compression is parallelized roughly as follows: We take a “packrow” (2^16 rows) and parse it into data packs, one for each column. Compression of those data packs can be then done in parallel. Now, imagine that all data packs within a given packrow are compressed except just one of them. I asked my colleague what would happen then. He says that in our current implementation, if this last data pack is an un-lookuped varchar, then we need to wait till its compression is finished before starting compressing data packs from the next packrow. Surely it’s just a hypothesis but it may justify what you see on CPU. Simply, one has to finish job and the others can’t start anything new. On average, it may look as you observed.
> Also, many varchars are with lookup comment
I did that experiment as well. I removed all “lookups”. The loading was about 3% faster.
The comparison table (ICE vs IEE) says that IEE supports multi-threading load up to 300Gb/hr vs ICE’s 55Gb/hr. I cannot achieve such results.
The comparison table (ICE vs IEE) says that IEE supports multi-threading load up to 300Gb/hr vs ICE’s 55Gb/hr. I cannot achieve such results.
I’m sorry to hear it. I’m not sure what to advise at this stage. These numbers are just an average reference point. Everything depends on data. I hope you’ll get better speed with the upcoming version of ICE/IEE.
ionna - 23 November 2010 12:02 PM
I removed all “lookups”. The loading was about 3% faster.
If the columns have low numbers of distinct values it’s better to keep them as “lookuped”. By the way, did you see any difference in compression ratios when giving up lookups? And how about CPU usage when you give up lookups?
Dominik,
No, I did not have a look at compression level or CPU usage without “lookups”. In my case it is not that critical. We are working with large “live” data volumes and loading speed is of more importance. Maybe I should try binary loading. By the way, how does binary loading handle errors? If there is an error in ascii file loading I can catch it but I am not sure if it is true for binary.
I have no complains on querying though