Joinutility seperatorLogin utility separator Infobright.com
   
 
data load vs. query performance
Posted: 17 June 2010 09:16 AM   Ignore ]  
Newbie
Avatar
Rank
Total Posts:  24
Joined  2010-03-29

Hi !

I just installed ICE and am doing some testing. My testing data is two sets of 25 files. Each file has +/- 1 million line. Each 25-file set in loaded in one table. So in the end, I have two tables, each with +/- 25 million rows. I almost have a 1-1 relationship between both tables but not quite. Otherwise, I would load all data in a single table.

I can load the data in two different ways. 1) I call the bulk loader 25 times for each table or 2) I combine all 25 files into 1 big file and call the bulk loader once for each table.

Here is the problem: query performance is dramatically worse in scenario 1 than in scenario 2. Why ?

Possible solutions would be : a) always “pool” data together before loading. However, this is not always possible. b) load data in chunks and then “optimize” the table. However, I could not find any optimizing tools for engine=brighthouse tables. c) another solution unknown to me as of yet.

I searched the different ICE forums, but could not find a definitive answer to this matter.

Any and all help highly appreciated !

Martin.

PS If relevant: testing data set is 25 million rows. Production data sets will be in billions of rows (notice “data sets” is plural).

Profile
 
Posted: 17 June 2010 09:40 AM   Ignore ]   [ # 1 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  912
Joined  2008-08-18

Hi !

There has never been a problem with loading data in chunks. I mean your report is the first one suggesting there is a different performance for different load scenarios, so it is difficult to tell what is going on.

Are you running queries and load at the same time?

And as usually, when asking about a performance problem - could you provide bh.err file for a sample query executed for (after) both load scenarios ?

J.

Profile
 
Posted: 17 June 2010 11:14 AM   Ignore ]   [ # 2 ]  
Newbie
Avatar
Rank
Total Posts:  24
Joined  2010-03-29

Hi !

Thanks for the quick response. Just to make sure, I tested both scenarios again from scratch. For simple queries, indeed, no significant differences. But for more complex queries, there is a huge difference: 20 seconds vs. 265 seconds in one instance !

Here is the content of bh.err, but there is nothing about the load or the queries. Must I turn some loggin option on ?

Loading configuration for Infobright instance ...
Option: AllowMySQLQueryPath, value: 0.
Option: BufferingLevel, value: 2.
Option: CacheFolder, value: C:/Program Files/Infobright/cache.
Option: CachingLevel, value: 1.
Option: ClusterSize, value: 2000.
Option: ControlMessages, value: 0.
Option: HugefileDir, value: .
Option: InternalMessages, value: 0.
Option: KNFolder, value: BH_RSI_Repository.
Option: KNLevel, value: 99.
Option: LicenseFile, value: <unknown>.
Option: LoaderMainHeapSize, value: 200.
Option: LoaderSaveThreadNumber, value: 16.
Option: PushDown, value: 1.
Option: ServerCompressedHeapSize, value: 100.
Option: ServerMainHeapSize, value: 800.
Option: UseCharset, value: 0.
Option: UseMySQLImportExportDefaults, value: 0.
Infobright instance configuration loaded.
100616 11:59:11 [Note] Event Scheduler: Loaded 0 events
100616 11:59:11 [Note] Infobright Community Edition 3.3.2 is ready for connections.
socket: ‘’ port: 5029 build number (revision)=IB_3.3.2_r7501_7556(ice)

As I mentionned earlier, I could not find a definitive reason for this performance discrepency, but according to this thread, it seems plausible that loading in chunks impacts performance. However, I guess that the loader should be able to reoptimize the data, should it not ?

Monitoring the server status while queries are executing, I noticed that the 265-seconds-query spends most of its time in the “optimizing” state. I ran it 3 times in a row, with similar timing result (265, 223, 220 seconds, respectively). This query has 4 joins: it joins twice on each of my 2 tables. Some other queries that perform the same in both scenarios also have joins, but this one is the only one with multiple joins on the same tables.

I run Infobright on my windows XP Pro 32 desktop and access my db through MySQL Workbench 5.2.18.

Profile
 
Posted: 17 June 2010 11:31 AM   Ignore ]   [ # 3 ]  
Newbie
Rank
Total Posts:  25
Joined  2009-08-06

Hi Martin,

If you set ControlMessages=2 in your brighthouse.ini then more information about the queries and their execution will be logged in bh.err.

Shawn

Profile
 
Posted: 17 June 2010 11:31 AM   Ignore ]   [ # 4 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  912
Joined  2008-08-18

Hi !

To have meaningful bh.err set ControlMesasges = 2, see http://www.infobright.org/wiki/Query_Execution_Log/ . Also there are several forum threads touching this.

In the bh.err you will get query log. The idea is to submit a log from a query run after a single large load and the same query run after 25 smaller loads.

it seems plausible that loading in chunks impacts performance

This would be a new and important discovery…

J.

[ Edited: 17 June 2010 11:59 AM by Janusz Borkowski]
Profile
 
Posted: 17 June 2010 04:19 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  735
Joined  2008-08-18

Hi,

A short comment not related directly to the problem:
For the best performance (and in a case of billions of rows it may mean “the only acceptable”) it is recommended to use 64-bit version of ICE and set memory in brighthouse.ini to some higher values. E.g. “ServerMainHeapSize = 2500” for 4 GB machine. Value 800 is quite low for performance-oriented tests.

Regards,

Signature 
Profile
 
Posted: 21 June 2010 10:30 AM   Ignore ]   [ # 6 ]  
Newbie
Avatar
Rank
Total Posts:  24
Joined  2010-03-29

Hello again !

Ok, ControlMessages is now 2. I have attached two log files, one for the query running fast and one for the query running slow. The fast query is the one performed on the tables loaded from a single file per table. The slow query is the one performed on identical tables but loaded for 25 files each. The problem seems definitely to be around :

2010-06-21 09:57:30 [4] Sorting roughly multiindex... 

in the slow query. What should I make of this ? Also, number of tuples produced is very different between both queries. What does that mean ?

Again, thanks for you help.

File Attachments 
fast_log.txt  (File Size: 4KB - Downloads: 238)
slow_log.txt  (File Size: 4KB - Downloads: 237)
Profile
 
Posted: 21 June 2010 01:50 PM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  735
Joined  2008-08-18

Hi,

The logs show two things:
1. KNs are different for both cases. Are the rows loaded in exactly the same order in both cases? I.e. a concatenation of the 25 smaller files (in the order of their loading) is equal to the large one?
2. There is a difference in join order in both cases, and this is the actual reason of performance difference. It may be a bug in optimizer.

Regards,

Signature 
Profile
 
Posted: 21 June 2010 04:15 PM   Ignore ]   [ # 8 ]  
Newbie
Avatar
Rank
Total Posts:  24
Joined  2010-03-29

Hi !

In the initial tests, rows were not loaded in the same order. Tests were performed again, this time loading rows in the same order. Same timing results were obtained. This is not surprising to me, since the “where” clause of the problematic query specifies data contained in only one of the 25 chunks. In other words, I could load only one of the 25 files to execute the query and get the same result set.

In this case, I could load data in the exact same order for both scenarios. But this will not always be possible.

Cheers,

M.

Profile
 
Posted: 22 June 2010 03:33 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  735
Joined  2008-08-18

Hi,

The order of loading data is very important for performance. If possible, data should be ordered (or nearly ordered) by the most limiting WHERE condition or by the joining key.

Regards,

Signature 
Profile
 
Posted: 22 June 2010 03:46 AM   Ignore ]   [ # 10 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  912
Joined  2008-08-18

hi !

2. There is a difference in join order in both cases, and this is the actual reason of performance difference. It may be a bug in optimizer.

It is not easy to reproduce this bug without sample data. Fortunately, there is a big chance it has been already fixed an the new ICE release (in July) will work fine.

Profile
 
Posted: 22 June 2010 08:31 AM   Ignore ]   [ # 11 ]  
Newbie
Avatar
Rank
Total Posts:  24
Joined  2010-03-29

Hi !

If you so wish, and we can arrange for transfer, I can provide the data set used for testing (955MB). Is the new release for beginning or end of July ?

Cheers,

M.

Profile