Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
Compression Ratio increases drastically with export + import?
Posted: 03 August 2010 01:29 PM   Ignore ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Hi there.

I was testing some DDL modification in order to gain more compression / less disk space for some tables we use, and I have seen some behaviour that surprised me.

I was testing adding one tinyint column to see how much disk size penalization we were going to suffer. The table has about 13.798.497 rows and was 288Mb on disk size.
I exported data into a file, then reloaded a tmp table with same DDL definition and disk size was only 57Mb…

Compresion ratio on original table was 2.342, while in the duplicated table was 12.158…

More info:
Original table is loaded through lots of “LOAD DATA INFILE” while application is running.
Duplicate table is loaded in one “LOAD DATA INFILE” (as I exported previously all original data to a single file)

Why is this happening? Does it have to do with load data in small “parts”, so IB engine can not compress so well compared to load data all in once?

I need to understand why there is such a difference, as it would be very interesting to reproduce this kind of compression.

Thanks in advance!!!!

Profile
 
Posted: 03 August 2010 04:18 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

Generally speaking loading procedure should not affect compression ratio for the same data. The only case of decreased compression would be for loading data by a series of single inserts in IEE with autocommit on.

Otherwise I can see two reasons:
1. In IEE, there were deleted rows in original table, which was “compacted” by the reloading.
2. The order of rows has changed, resulting in better compression of some columns.

Regards,

Signature 
Profile
 
Posted: 03 August 2010 04:50 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Hi Jakub.

We are using ICE 3.3.1.

The order of the rows… mmm. That could explain it, yes. The point is that I did something like:

SELECT FROM tableA INTO OUTFILE 'tmp.csv';
CREATE TABLE test LIKE tableA;
LOAD DATA INFILE 'tmp.csv' INTO TABLE tableA

Does export/import reorganize data in some way I didn’t expected?
As far as I know, MySQL sorts data by insertion if no index/pk/key given (and there are not in IB). Why should data been reordered on export/import?

I will make some tests tomorrow to try to get profit from row order (and check if that was the reason that explains the better compression)

Thanks a lot!

Profile
 
Posted: 04 August 2010 05:54 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

Reordering is always explicit - the procedure you’re describing should not change row order. So the reasons are still mysterious…


Regards,

Signature 
Profile
 
Posted: 04 August 2010 06:29 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Hi again Jakub, thanks for the replies.
(I was writting down this post by the time of your last answer, and yes, I agree there is some mistery still to find out)

I’ve been researching a bit, and I cannot say I’m convinced of the order of the rows explaining the x5 compression.
I need some help trying to figure out what happened here. I will try to give all the info I’m collecting.
This is the table that I’m testing (I have to mention, that for the other 25 tables in the database, the same behaviour occurs…)

CREATE TABLE `passiveHsdpaRadioParameters_201005` (
  `
idPdpConnectionbigint(20NOT NULL,
  `
timestamptimestamp NOT NULL DEFAULT '1985-01-19 00:00:00',
  `
idUsedLocationint(11NOT NULL,
  `
cqitinyint(4) DEFAULT NULL,
  `
hsDschQam16Packetssmallint(6) DEFAULT NULL,
  `
hsDschQam64Packetssmallint(6) DEFAULT NULL,
  `
hsDschQpskPacketssmallint(6) DEFAULT NULL,
  `
hsDschSchedPacketssmallint(6) DEFAULT NULL,
  `
hsDschNackPacketssmallint(6) DEFAULT NULL,
  `
hsDschTrafficint(11) DEFAULT NULL,
  `
hsDschUplinkTotalmediumint(9) DEFAULT NULL,
  `
numberOfCodestinyint(4) DEFAULT NULL
ENGINE=BRIGHTHOUSE DEFAULT CHARSET=utf8

There is no relation between columns, I mean, if by any reason rows would be sorted by any column, that could improve compression by that column specifically, but not for the other colums… So I’m not sure column order could explain a x5 compression.

shelldu -h passiveHsdpaRadioParameters_201005.bht/
288M    passiveHsdpaRadioParameters_201005.bht/
shellzip -r passiveHsdpaOriginalTable.zip passiveHsdpaRadioParameters_201005.bht/
...
shellls -lh passiveHsdpaOriginalTable.zip 
-rw-r--r-- 1 root root 257M 2010-08-04 12:09 passiveHsdpaOriginalTable.zip 

I’ve zipped folder passiveHsdpaRadioParameters_201005.bht and a little more compression archieved, but nothing compared with export/import.

But if I do what I posted in fist place:

mysqlSET @bh_dataformat 'txt_variable';
Query OK0 rows affected (0.00 sec)

mysqlSELECT count(1FROM passiveHsdpaRadioParameters_201005;
+----------+
count(1) |
+----------+
13798497 |
+----------+
1 row in set (0.00 sec)

mysqlCREATE TABLE passiveHsdpaRadioParameters_201005_reimported LIKE passiveHsdpaRadioParameters_201005;
Query OK0 rows affected (0.01 sec)

mysqlSELECT FROM passiveHsdpaRadioParameters_201005 INTO OUTFILE '/tmp/passiveHsdpaRadioParameters_201005.csv';
Query OK13798497 rows affected (45.69 sec)

LOAD DATA INFILE '/tmp/passiveHsdpaRadioParameters_201005.csv' INTO TABLE passiveHsdpaRadioParameters_201005_reimported;
Query OK13798497 rows affected (49.00 sec)
Records13798497  Deleted0  Skipped0  Warnings

And now let’s look at .bht folder disk size.

shelldu -h passiveHsdpaRadioParameters_201005_reimported.bht/
57M    passiveHsdpaRadioParameters_201005_reimported.bht

From 288M to 57M…

I’m going to test loading table with single LOAD DATA INFILE for each row vs LOAD DATA INFILE for all rows at a time, and see if compression is affected.


I will post the results shortly.

By the way, I could upload the zipped folders for both tables (original and reimported) if it could be helpfull to see what this is happening.

Profile
 
Posted: 04 August 2010 07:17 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

Would you, please, check which files changed their size inside the *.bht folders for both cases?

Regards,

Signature 
Profile
 
Posted: 04 August 2010 07:50 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

The differences I can observe:
54K   TA00000000000001.ctb <——————Originally 7.2M
1.4M TA00001000000001.ctb <——————Originally 14M
181K TA00002000000001.ctb <——————Originally 8.3M
2.9M TA00003000000001.ctb <——————Originally 6.0M
4.0M TA00004000000001.ctb <——————Originally 25M
9.0M TA00006000000001.ctb <——————Originally 9.4M
9.5M TA00007000000001.ctb <——————Originally 9.7M
3.9M TA00008000000001.ctb <——————Originally 24M
20M TA00009000000001.ctb <——————Originally 23M
2.3M TA00010000000001.ctb <——————Originally 15M
4.0M TA00011000000001.ctb <——————Originally 5.7M
Table.ctb <———————————- Originally 91

And a big difference is that in the original table folder (the one not reimported), there are the same files as the list above, but ending with 0 and with a similar size:

7.1M TA00000000000000.ctb
14M TA00001000000000.ctb
8.3M TA00002000000000.ctb
5.9M TA00003000000000.ctb
25M TA00004000000000.ctb
9.1M TA00006000000000.ctb
9.7M TA00007000000000.ctb
23M TA00008000000000.ctb
23M TA00009000000000.ctb
14M TA00010000000000.ctb
5.7M TA00011000000000.ctb

This files does not appear at folder of reimported table.

Profile
 
Posted: 04 August 2010 11:10 AM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Results from loading table with LOAD DATA INFILE for each row vs LOAD DATA INFILE for all rows in a table:

I used previous table described above: passiveHsdpaRadioParameters_201005

mysqlSELECT FROM passiveHsdpaRadioParameters_201005 LIMIT 1000 INTO OUTFILE '/tmp/testPassiveHsdpaRadioParametersLOAD.csv';
Query OK1000 rows affected (0.01 sec)
mysqlCREATE TABLE loadOnce LIKE passiveHsdpaRadioParameters_201005;
Query OK0 rows affected (0.04 sec)
mysqlLOAD DATA INFILE '/tmp/testPassiveHsdpaRadioParametersLOAD.csv' INTO TABLE loadOnce;
Query OK1000 rows affected (0.15 sec)
Records1000  Deleted0  Skipped0  Warnings

Let’s see it’s disk size:

shelldu -h loadOnce.bht/
196K    loadOnce.bht

I’ve created a php script to create a table like original, and for each row: SELECT INTO OUTFILE + LOAD DATA INFILE + rm file.
Results for this table are just… intriguing

After execute php script, table loadOnce_reimported has been recreated and filled with same data. If we look at disk size…

shell>du -h loadOnce_reimported.bht/
364K    loadOnce_reimported.bht


Table loaded with 1000 rows in a single LOAD DATA INFILE:
196K
Table loaded with 1000 rows, one row in a single LOAD DATA INFILE:
364K

In folders, the difference is the same as in previous post:

All files named like TA0000X000000001.ctb are smaller in the table loadad in a single LOAD DATA INFILE.
All files named like TA0000X000000000.ctb does not exists in the table loaded in a single LOAD DATA INFILE although exists at table loaded with LOAD DATA INFILE for each row, with similar size than TA0000X000000001.ctb..

What’s in this files? I suposed this were the data packs… Why loading row by row creates more files than loading all rows in a single LOAD DATA?

File Attachments 
secondsSizeIncrease.txt  (File Size: 2KB - Downloads: 246)
Profile
 
Posted: 04 August 2010 11:21 AM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Sorry, I attached wrong file… this is the php script. If needed, just replace define lines at the beggining of the script and should work properly

<?php
define
("SERVER","");
define("USER","");
define("PASS","");
define("DB","");
define("TABLE","");

function 
connectDB($host,$user,$pass,$db NULL){
    $dbLink 
mysql_connect($host,$user,$pass);
    if(!
$dbLink){
        
die("Error connecting to db server.\n".mysql_error($dbLink)."\n");
    
}
    
if($db != NULL){
        
if(!mysql_select_db("shard_nxray_loadTest_1",$dbLink)){
            
die("Error selecting db $db\n".mysql_error($dbLink)."\n");
        
}
    }
    
return $dbLink;
}

function executeQuery($query,$dbLink){
    $result 
mysql_query($query,$dbLink);
    if(!
$result){
        
die("Error executing query: $query\n.".mysql_error($dbLink)."\n");
    
}
    
return $result;
}
//connect to db
$dbLink connectDB(SERVER,USER,PASS,DB);
// get rows from table
$query "SELECT COUNT(*) AS numRows FROM ".TABLE;
$result executeQuery($query,$dbLink);
$row=mysql_fetch_array($result);
$numRows $row["numRows"];
// create table like testing table
$query "CREATE TABLE ".TABLE."_reimported LIKE ".TABLE;
executeQuery($query,$dbLink);
$query "SET @bh_dataformat='txt_variable'";
executeQuery($query,$dbLink);
// select data into numRows files (1 row per file)
for($i=0;$i<$numRows;$i++){
    $query 
"SELECT * FROM ".TABLE." LIMIT $i,1 INTO OUTFILE '/tmp/tmp_$i'";
    
executeQuery($query,$dbLink);
       
    
$query "LOAD DATA INFILE '/tmp/tmp_$i' INTO TABLE ".TABLE."_reimported";
    
executeQuery($query,$dbLink);
    
    
shell_exec("rm /tmp/tmp_$i");
}
mysql_close
($dbLink);
?> 
Profile
 
Posted: 04 August 2010 11:22 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

The behaviour you’ve described in the last post is normal. The overhead of one-row inserting vs. bulk load is a double disk space for the last data pack (in your case - the only data pack). It means that the difference will be 100% in all cases there are less than 65536 rows. It is caused by a mechanism preventing data packs to be overwritten by a newer version, to allow transaction rollback.

However, the difference should be proportionally smaller for tables with more than one data pack. In your original problem there is more than 200 data packs loaded, so the relative difference between these two scenarios should be not 100% (or 500%, as you reported), but about 0.5%. If it is larger, it may be a bug.

I suggest to repeat your experiment with more rows (e.g. a million).

Regards,

Signature 
Profile
 
Posted: 04 August 2010 11:35 AM   Ignore ]   [ # 10 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Well, I posted this results because I was testing the original table (13.000.000+ rows) and after 1h 30m disk size was about 1.1G (originally 288M and after export+import 57M) for only the first 50.000 rows!

I stopped the test and reproduce it with just 1000 rows.

I have no problem with repeat the test with, e.g. a million rows. I will do it right now, but I can almost ensure results will be worst than my first post.

Thanks again. I will post the results when I got them.

Profile
 
Posted: 04 August 2010 04:13 PM   Ignore ]   [ # 11 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Results for testing LOAD DATA INFILE row by row 1.000.000 times vs LOAD DATA INFILE 1.000.000 rows at once:


Get first table, with 1.000.000 rows loaded in a single LOAD DATA INFILE.

mysqlCREATE TABLE loadOnce_oneMillion LIKE passiveHsdpaRadioParameters_201005;
Query OK0 rows affected (0.00 sec)

mysqlSELECT FROM passiveHsdpaRadioParameters_201005 LIMIT 1000000 INTO OUTFILE '/tmp/pasHsdpRad_oneMillionRows.csv';
Query OK1000000 rows affected (3.97 sec)

mysqlLOAD DATA INFILE '/tmp/pasHsdpRad_oneMillionRows.csv' INTO TABLE loadOnce_oneMillion;
Query OK1000000 rows affected (4.88 sec)
Records1000000  Deleted0  Skipped0  Warnings

disk size / files in folder:

shell>  du -h loadOnce_oneMillion.bht/
4.4M    loadOnce_oneMillion.bht/

shell>ls -lh loadOnce_oneMillion.bht/
total 4.4M
-rw-rw---- 1 mysql mysql 5.5K 2010-08-04 17:41 TA00000000000001.ctb
-rw-rw---- 1 mysql mysql  130 2010-08-04 17:41 TA00000.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00000DPN.ctb
-rw-rw---- 1 mysql mysql  96K 2010-08-04 17:41 TA00001000000001.ctb
-rw-rw---- 1 mysql mysql  124 2010-08-04 17:41 TA00001.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00001DPN.ctb
-rw-rw---- 1 mysql mysql  16K 2010-08-04 17:41 TA00002000000001.ctb
-rw-rw---- 1 mysql mysql  129 2010-08-04 17:41 TA00002.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00002DPN.ctb
-rw-rw---- 1 mysql mysql 220K 2010-08-04 17:41 TA00003000000001.ctb
-rw-rw---- 1 mysql mysql  118 2010-08-04 17:41 TA00003.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00003DPN.ctb
-rw-rw---- 1 mysql mysql 335K 2010-08-04 17:41 TA00004000000001.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:41 TA00004.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00004DPN.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:41 TA00005.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00005DPN.ctb
-rw-rw---- 1 mysql mysql 674K 2010-08-04 17:41 TA00006000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:41 TA00006.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00006DPN.ctb
-rw-rw---- 1 mysql mysql 721K 2010-08-04 17:41 TA00007000000001.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:41 TA00007.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00007DPN.ctb
-rw-rw---- 1 mysql mysql 313K 2010-08-04 17:41 TA00008000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:41 TA00008.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00008DPN.ctb
-rw-rw---- 1 mysql mysql 1.5M 2010-08-04 17:41 TA00009000000001.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 17:41 TA00009.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00009DPN.ctb
-rw-rw---- 1 mysql mysql 168K 2010-08-04 17:41 TA00010000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:41 TA00010.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00010DPN.ctb
-rw-rw---- 1 mysql mysql 289K 2010-08-04 17:41 TA00011000000001.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 17:41 TA00011.ctb
-rw-rw---- 1 mysql mysql  629 2010-08-04 17:41 TA00011DPN.ctb
-rw-rw---- 1 mysql mysql   76 2010-08-04 17:37 Table.ctb
-rw-rw---- 1 mysql mysql  130 2010-08-04 17:37 TB00000.ctb
-rw-rw---- 1 mysql mysql  124 2010-08-04 17:37 TB00001.ctb
-rw-rw---- 1 mysql mysql  129 2010-08-04 17:37 TB00002.ctb
-rw-rw---- 1 mysql mysql  118 2010-08-04 17:37 TB00003.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:37 TB00004.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:37 TB00005.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:37 TB00006.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 17:37 TB00007.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:37 TB00008.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 17:37 TB00009.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 17:37 TB00010.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 17:37 TB00011.ctb 

Get second table, with 1.000.000 rows loaded in 1.000.000 LOAD DATA INFILE queries (1 query per row in table).

Ok, I have stopped the test, as it has been running for 4+ hours now, and the results show there is something wrong.

Executed the php code from above over table loadOnce_oneMillion.
Script creates table loadOnce_oneMillion_reimported and starts selecting 1 row into outfile from loadOnce_oneMillion, and load that file into loadOnce_oneMillion_reimported.

mysqlselect count(1FROM loadOnce_oneMillion_reimported;
+----------+
count(1) |
+----------+
|   
101636 |
+----------+
1 row in set (0.00 sec

And size/files for table:

shelldu -h loadOnce_oneMillion_reimported.bht/
2.5G    loadOnce_oneMillion_reimported.bht/

shell>ls -lh loadOnce_oneMillion_reimported.bht/
total 2.5G
-rw-rw---- 1 mysql mysql    0 2010-08-04 21:14 ab_switch
-rw-rw---- 1 mysql mysql  19M 2010-08-04 21:14 TA00000000000000.ctb
-rw-rw---- 1 mysql mysql  19M 2010-08-04 21:14 TA00000000000001.ctb
-rw-rw---- 1 mysql mysql  130 2010-08-04 21:14 TA00000.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00000DPN.ctb
-rw-rw---- 1 mysql mysql  71M 2010-08-04 21:14 TA00001000000000.ctb
-rw-rw---- 1 mysql mysql  71M 2010-08-04 21:14 TA00001000000001.ctb
-rw-rw---- 1 mysql mysql  124 2010-08-04 21:14 TA00001.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00001DPN.ctb
-rw-rw---- 1 mysql mysql  24M 2010-08-04 21:14 TA00002000000000.ctb
-rw-rw---- 1 mysql mysql  24M 2010-08-04 21:14 TA00002000000001.ctb
-rw-rw---- 1 mysql mysql  129 2010-08-04 21:14 TA00002.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00002DPN.ctb
-rw-rw---- 1 mysql mysql 109M 2010-08-04 21:14 TA00003000000000.ctb
-rw-rw---- 1 mysql mysql 109M 2010-08-04 21:14 TA00003000000001.ctb
-rw-rw---- 1 mysql mysql  118 2010-08-04 21:14 TA00003.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00003DPN.ctb
-rw-rw---- 1 mysql mysql 141M 2010-08-04 21:14 TA00004000000000.ctb
-rw-rw---- 1 mysql mysql 140M 2010-08-04 21:14 TA00004000000001.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TA00004.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00004DPN.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TA00005.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00005DPN.ctb
-rw-rw---- 1 mysql mysql 171M 2010-08-04 21:14 TA00006000000000.ctb
-rw-rw---- 1 mysql mysql 171M 2010-08-04 21:14 TA00006000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TA00006.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00006DPN.ctb
-rw-rw---- 1 mysql mysql 182M 2010-08-04 21:14 TA00007000000000.ctb
-rw-rw---- 1 mysql mysql 180M 2010-08-04 21:14 TA00007000000001.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TA00007.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00007DPN.ctb
-rw-rw---- 1 mysql mysql 104M 2010-08-04 21:14 TA00008000000000.ctb
-rw-rw---- 1 mysql mysql 104M 2010-08-04 21:14 TA00008000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TA00008.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00008DPN.ctb
-rw-rw---- 1 mysql mysql 198M 2010-08-04 21:14 TA00009000000000.ctb
-rw-rw---- 1 mysql mysql 195M 2010-08-04 21:14 TA00009000000001.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 21:14 TA00009.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00009DPN.ctb
-rw-rw---- 1 mysql mysql  94M 2010-08-04 21:14 TA00010000000000.ctb
-rw-rw---- 1 mysql mysql  93M 2010-08-04 21:14 TA00010000000001.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TA00010.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00010DPN.ctb
-rw-rw---- 1 mysql mysql 134M 2010-08-04 21:14 TA00011000000000.ctb
-rw-rw---- 1 mysql mysql 134M 2010-08-04 21:14 TA00011000000001.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 21:14 TA00011.ctb
-rw-rw---- 1 mysql mysql  111 2010-08-04 21:14 TA00011DPN.ctb
-rw-rw---- 1 mysql mysql   87 2010-08-04 17:46 Table.ctb
-rw-rw---- 1 mysql mysql  130 2010-08-04 21:14 TB00000.ctb
-rw-rw---- 1 mysql mysql  124 2010-08-04 21:14 TB00001.ctb
-rw-rw---- 1 mysql mysql  129 2010-08-04 21:14 TB00002.ctb
-rw-rw---- 1 mysql mysql  118 2010-08-04 21:14 TB00003.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TB00004.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TB00005.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TB00006.ctb
-rw-rw---- 1 mysql mysql  133 2010-08-04 21:14 TB00007.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TB00008.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 21:14 TB00009.ctb
-rw-rw---- 1 mysql mysql  132 2010-08-04 21:14 TB00010.ctb
-rw-rw---- 1 mysql mysql  128 2010-08-04 21:14 TB00011.ctb 

 

Sum up:

1st.
Exported 1.000.000 to a new file from original table.
Create tmp table like original and loaded 1.000.000 rows with load data infile.
Load data took 4.88 sec and disk size was 4.4M

2nd.
Executed php script to create table like tmp, select row by row into new file + load each file into new tmp table.

Load data took 4+ hours and only 100.000+ rows were loaded.
Disk space for those 100.000+ rows is 2.5G


I have attached .csv file with 100.000 rows, in case you want to test it by yourself (if you need a bigger one, I won’t have any problem uploading it as data in csv is not really important wink )
You have the DDL for the table, the csv file and the php script (I’m sure you won’t need that last one wink ) to reproduce the problem. I cannot say really what is happening.
I will be glad to test anything else you can tell or change my experiments to fit anything you need to find out what is happening.

I think this is not “normal” behaviour (adding more disk space after each load data infile).

Thanks a lot

[ Edited: 05 August 2010 05:29 AM by Nadir]
File Attachments 
100000_rows.csv.zip  (File Size: 936KB - Downloads: 252)
Profile
 
Posted: 06 August 2010 10:20 AM   Ignore ]   [ # 12 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Hi again.

I know there are a lot of posts to reply and lot of work to do wink , so I was thinking on posting it into the bug list if there is no explanation for it yet (and it looks like the behaviour is not expected).

We can think in a couple of workarounds for this, but as we are not really sure what is really happening, our workarounds could just not do what we expect.
We assume there is some ¿“duplication”? of data when executing “load data infile” so if we reduce the number of queries to load data, the less penalty we will suffer from this ¿“bug”?

Please correct me if I’m wrong (obviously, if there is an explanation), as we could workaround this wrongly and get unexpected results… wink
I just need some guidance in how to workaround this, as it seems that we can get much more profit from IB compression than we do right now.

Thanks in advance!!!

Profile
 
Posted: 07 August 2010 03:38 AM   Ignore ]   [ # 13 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  763
Joined  2008-08-18

Hi,

We’re working on it (which means it’s in our queue). As a workaround, guessing a reason, I would suggest to load data in larger portions (>65536 rows), even if it means keeping the data in a temporary buffer/table for some time. Or to export/import the whole table from time to time (i.e. when the disk overhead starts to be an issue).
I would also check the latest ICE version whether the bug still occurs.

Regards,

Signature 
Profile
 
Posted: 29 November 2011 09:19 PM   Ignore ]   [ # 14 ]  
Newbie
Rank
Total Posts:  8
Joined  2011-07-14

Hi Jakub,

We are using infobright 4.0.3 and are facing similar issue as described in this post. As of now, we are doing export/import of whole database periodically to keep the data size in check. Can you please let us know the status of this issue and when can we expect this to be resolved?

Thanks,
Puneet

Profile
 
Posted: 30 November 2011 05:27 AM   Ignore ]   [ # 15 ]  
Newbie
Rank
Total Posts:  16
Joined  2009-11-09

Hi Puneet.

After I posted this thread on the forum, I posted its related bug on the buglist:
http://bugs.infobright.org/ticket/1916

Although it is still on the rally and there has been no activity in the past 9 months…

As commented in the bug post, it seems that a single LOAD DATA creates a data pack for data loaded, so if you are inserting few rows on each LOAD DATA query, you are “wasting” the rest of the 65K rows that fits on a single data pack… I assume overhead of data pack is what causes disk size increasing too high with few rows. And of course, that if every few rows are being loaded into separated data packs engine is unable to compress them together (I think compression is made on each data pack).

I’m sure Januz will shed some light to this wink

Assuming there has been no activity in the past few months, I’m not waiting Infobright team to resolve this soon…

Good Luck!
Nadir

[ Edited: 30 November 2011 05:32 AM by Nadir]
Profile
 
   
1 of 2
1