Joinutility seperatorLogin utility separator Infobright.com
   
1 of 2
1
Same ‘group by’ query with large unsorted data set produces wrong result.
Posted: 02 July 2010 11:03 AM   Ignore ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

HI,

I am running a group by query with grouping on multiple parameters.

When this query runs on a large sorted dataset of around 30 million records, it works fine; but when the same same query runs on an unsorted dataset, it gives wrong result for group by.

What could be the possible reason for this. Can it be a memory issue or it has to do something with infobright configuration.

Kindly let me know your inputs.

Thanks,
Manu

Profile
 
Posted: 02 July 2010 02:53 PM   Ignore ]   [ # 1 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

Hi,

Please find below the execution log for the same query, being run on different tables. In one of the tables sorted data is loaded and in another unsorted (before loading the data to one table, the file was initially sorted and then data was loaded and in another table, the file was loaded as such,unsorted).

Execution log, for the table with sorted data set,

2010-07-02 14:39:16 [3] T:-1 = TABLE_ALIAS(T:0,“trades_temp”)
T:-2 = TMP_TABLE(T:-1)
VC:-2.0 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:0))
A:-1 = T:-2.ADD_COLUMN(VC:-2.0,GROUP_BY,“symbol”,“ALL”)
VC:-2.1 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:1))
A:-2 = T:-2.ADD_COLUMN(VC:-2.1,GROUP_BY,“date”,“ALL”)
VC:-2.2 = CREATE_VC(T:-2,EXPR(“floor”))
A:-3 = T:-2.ADD_COLUMN(VC:-2.2,GROUP_BY,“bucket”,“ALL”)
A:-4 = T:-2.ADD_COLUMN(<null>,COUNT,“trades”,“ALL”)
VC:-2.3 = CREATE_VC(T:-2,EXPR(”*”))
A:-5 = T:-2.ADD_COLUMN(VC:-2.3,SUM,“null”,“ALL”)
VC:-2.4 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:5))
A:-6 = T:-2.ADD_COLUMN(VC:-2.4,SUM,“null”,“ALL”)
VC:-2.5 = CREATE_VC(T:-2,EXPR(“round”))
A:-7 = T:-2.ADD_COLUMN(VC:-2.5,DELAYED,“vwap”,“ALL”)
A:-8 = T:-2.ADD_COLUMN(VC:-2.4,SUM,“volume”,“ALL”)
VC:-2.6 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:4))
A:-9 = T:-2.ADD_COLUMN(VC:-2.6,MIN,“null”,“ALL”)
VC:-2.7 = CREATE_VC(T:-2,EXPR(“round”))
A:-10 = T:-2.ADD_COLUMN(VC:-2.7,DELAYED,“low”,“ALL”)
A:-11 = T:-2.ADD_COLUMN(VC:-2.6,MAX,“null”,“ALL”)
VC:-2.8 = CREATE_VC(T:-2,EXPR(“round”))
A:-12 = T:-2.ADD_COLUMN(VC:-2.8,DELAYED,“high”,“ALL”)
VC:-2.9 = CREATE_VC(T:-2,EXPR(“20100623”))
F:0 = CREATE_FILTER(T:-2,VC:-2.1,=,VC:-2.9,<null>,WHERE)
VC:-2.10 = CREATE_VC(T:-2,EXPR(“0”))
F:0 = AND(F:0,VC:-2.4,>,VC:-2.10,<null>)
VC:-2.11 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:2))
VC:-2.12 = CREATE_VC(T:-2,EXPR(“09:30:00”))
F:0 = AND(F:0,VC:-2.11,>=,VC:-2.12,<null>)
VC:-2.13 = CREATE_VC(T:-2,EXPR(“16:00:00”))
F:0 = AND(F:0,VC:-2.11,<,VC:-2.13,<null>)
T:-2.FILTER(F:0)
T:-3 = TMP_TABLE(T:-2)
VC:-3.0 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-1))
A:-1 = T:-3.ADD_COLUMN(VC:-3.0,LIST,“symbol”,“ALL”)
VC:-3.1 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-2))
A:-2 = T:-3.ADD_COLUMN(VC:-3.1,LIST,“date”,“ALL”)
VC:-3.2 = CREATE_VC(T:-3,EXPR(“sec_to_time”))
A:-3 = T:-3.ADD_COLUMN(VC:-3.2,LIST,“time”,“ALL”)
VC:-3.3 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-4))
A:-4 = T:-3.ADD_COLUMN(VC:-3.3,LIST,“trades”,“ALL”)
VC:-3.4 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-7))
A:-5 = T:-3.ADD_COLUMN(VC:-3.4,LIST,“vwap”,“ALL”)
VC:-3.5 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-8))
A:-6 = T:-3.ADD_COLUMN(VC:-3.5,LIST,“volume”,“ALL”)
VC:-3.6 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-10))
A:-7 = T:-3.ADD_COLUMN(VC:-3.6,LIST,“low”,“ALL”)
VC:-3.7 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-12))
A:-8 = T:-3.ADD_COLUMN(VC:-3.7,LIST,“high”,“ALL”)
T:-3.ADD_ORDER(VC:-3.1,ASC)
T:-3.ADD_ORDER(VC:-3.0,ASC)
T:-3.ADD_ORDER(VC:-3.2,ASC)
T:-3.MODE(LIMIT,0,10000)
RESULT(T:-3)

2010-07-02 14:39:16 Type conversion for VC:9
2010-07-02 14:39:16 Type conversion for VC:12
2010-07-02 14:39:16 Type conversion for VC:13
2010-07-02 14:39:16 [3] Merging conditions…
2010-07-02 14:39:16 [3] Initial execution plan (non-join):
2010-07-02 14:39:16 [3] Cnd(0):  VC:11(t0a2) BET. VC:21 AND VC:24     (16.96)
2010-07-02 14:39:16 [3] Cnd(1):  VC:4(t0a5) BET. VC:19(1) AND VC:20(+inf)    (17.27)
2010-07-02 14:39:16 [3] Cnd(2):  VC:1(t0a1) BET. VC:17 AND VC:18     (17.27)
2010-07-02 14:39:16 [3] Packs/packrows after KN evaluation:
2010-07-02 14:39:16 [3] (t0) Pckrows: 487, susp. 437 (0 empty 50 full). Packs opened in 3 cond.: 512
2010-07-02 14:39:19 [3] Packrows after exact evaluation (WHERE):
2010-07-02 14:39:19 [3] (t0): 487 all packrows, 487 to open (including 50 full)
2010-07-02 14:39:20 [3] Aggregating: 31489951 tuples left, 0 gr. found so far
2010-07-02 14:40:46 [3] Aggregating: 9347374 tuples left, 888205 gr. found so far
2010-07-02 14:41:25 [3] Aggregating: 45 tuples left, 1201188 gr. found so far
2010-07-02 14:41:25 [3] Aggregated (1201191 gr). Omitted packrows: 124 + 706 partially, out of 830 total.
2010-07-02 14:41:35 [3] Sorter initialized for 1201191 rows, 34+40 bytes each.
2010-07-02 14:41:39 [3] Sorted rows retrieved.
2010-07-02 14:41:39 [3] Displaying result: 10000 rows.
2010-07-02 14:41:40 [3] Total data packs actually loaded (approx.): 3963
2010-07-02 14:41:40 [3]——————————————————————————————————————

Execution log for the table with unsorted data set,

2010-07-02 14:34:37 [2] T:-1 = TABLE_ALIAS(T:0,“trades”)
T:-2 = TMP_TABLE(T:-1)
VC:-2.0 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:0))
A:-1 = T:-2.ADD_COLUMN(VC:-2.0,GROUP_BY,“symbol”,“ALL”)
VC:-2.1 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:1))
A:-2 = T:-2.ADD_COLUMN(VC:-2.1,GROUP_BY,“date”,“ALL”)
VC:-2.2 = CREATE_VC(T:-2,EXPR(“floor”))
A:-3 = T:-2.ADD_COLUMN(VC:-2.2,GROUP_BY,“bucket”,“ALL”)
A:-4 = T:-2.ADD_COLUMN(<null>,COUNT,“trades”,“ALL”)
VC:-2.3 = CREATE_VC(T:-2,EXPR(”*”))
A:-5 = T:-2.ADD_COLUMN(VC:-2.3,SUM,“null”,“ALL”)
VC:-2.4 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:5))
A:-6 = T:-2.ADD_COLUMN(VC:-2.4,SUM,“null”,“ALL”)
VC:-2.5 = CREATE_VC(T:-2,EXPR(“round”))
A:-7 = T:-2.ADD_COLUMN(VC:-2.5,DELAYED,“vwap”,“ALL”)
A:-8 = T:-2.ADD_COLUMN(VC:-2.4,SUM,“volume”,“ALL”)
VC:-2.6 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:4))
A:-9 = T:-2.ADD_COLUMN(VC:-2.6,MIN,“null”,“ALL”)
VC:-2.7 = CREATE_VC(T:-2,EXPR(“round”))
A:-10 = T:-2.ADD_COLUMN(VC:-2.7,DELAYED,“low”,“ALL”)
A:-11 = T:-2.ADD_COLUMN(VC:-2.6,MAX,“null”,“ALL”)
VC:-2.8 = CREATE_VC(T:-2,EXPR(“round”))
A:-12 = T:-2.ADD_COLUMN(VC:-2.8,DELAYED,“high”,“ALL”)
VC:-2.9 = CREATE_VC(T:-2,EXPR(“20100623”))
F:0 = CREATE_FILTER(T:-2,VC:-2.1,=,VC:-2.9,<null>,WHERE)
VC:-2.10 = CREATE_VC(T:-2,EXPR(“0”))
F:0 = AND(F:0,VC:-2.4,>,VC:-2.10,<null>)
VC:-2.11 = CREATE_VC(T:-2,PHYS_COL(T:-1,A:2))
VC:-2.12 = CREATE_VC(T:-2,EXPR(“09:30:00”))
F:0 = AND(F:0,VC:-2.11,>=,VC:-2.12,<null>)
VC:-2.13 = CREATE_VC(T:-2,EXPR(“16:00:00”))
F:0 = AND(F:0,VC:-2.11,<,VC:-2.13,<null>)
T:-2.FILTER(F:0)
T:-3 = TMP_TABLE(T:-2)
VC:-3.0 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-1))
A:-1 = T:-3.ADD_COLUMN(VC:-3.0,LIST,“symbol”,“ALL”)
VC:-3.1 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-2))
A:-2 = T:-3.ADD_COLUMN(VC:-3.1,LIST,“date”,“ALL”)
VC:-3.2 = CREATE_VC(T:-3,EXPR(“sec_to_time”))
A:-3 = T:-3.ADD_COLUMN(VC:-3.2,LIST,“time”,“ALL”)
VC:-3.3 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-4))
A:-4 = T:-3.ADD_COLUMN(VC:-3.3,LIST,“trades”,“ALL”)
VC:-3.4 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-7))
A:-5 = T:-3.ADD_COLUMN(VC:-3.4,LIST,“vwap”,“ALL”)
VC:-3.5 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-8))
A:-6 = T:-3.ADD_COLUMN(VC:-3.5,LIST,“volume”,“ALL”)
VC:-3.6 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-10))
A:-7 = T:-3.ADD_COLUMN(VC:-3.6,LIST,“low”,“ALL”)
VC:-3.7 = CREATE_VC(T:-3,PHYS_COL(T:-2,A:-12))
A:-8 = T:-3.ADD_COLUMN(VC:-3.7,LIST,“high”,“ALL”)
T:-3.ADD_ORDER(VC:-3.1,ASC)
T:-3.ADD_ORDER(VC:-3.0,ASC)
T:-3.ADD_ORDER(VC:-3.2,ASC)
T:-3.MODE(LIMIT,0,10000)
RESULT(T:-3)

2010-07-02 14:34:37 Type conversion for VC:9
2010-07-02 14:34:37 Type conversion for VC:12
2010-07-02 14:34:37 Type conversion for VC:13
2010-07-02 14:34:43 [2] Merging conditions…
2010-07-02 14:34:43 [2] Initial execution plan (non-join):
2010-07-02 14:34:43 [2] Cnd(0):  VC:1(t0a1) BET. VC:17 AND VC:18     (17.27)
2010-07-02 14:34:43 [2] Cnd(1):  VC:11(t0a2) BET. VC:21 AND VC:24     (22.11)
2010-07-02 14:34:43 [2] Cnd(2):  VC:4(t0a5) BET. VC:19(1) AND VC:20(+inf)    (22.57)
2010-07-02 14:34:43 [2] Packs/packrows after KN evaluation:
2010-07-02 14:34:43 [2] (t0) Pckrows: 97174, susp. 488 (96686 empty 0 full). Packs opened in 3 cond.: 978
2010-07-02 14:34:52 [2] Packrows after exact evaluation (WHERE):
2010-07-02 14:34:52 [2] (t0): 97174 all packrows, 488 to open (including 0 full)
2010-07-02 14:34:55 [2] Aggregating: 31489951 tuples left, 0 gr. found so far
2010-07-02 14:35:50 [2] Aggregating: 20122610 tuples left, 888205 gr. found so far
2010-07-02 14:36:48 [2] Aggregating: 9652580 tuples left, 1776410 gr. found so far
2010-07-02 14:37:42 [2] Aggregating: 81090 tuples left, 2650532 gr. found so far
2010-07-02 14:37:50 [2] Aggregated (2699417 gr). Omitted packrows: 458 + 948 partially, out of 1409 total.
2010-07-02 14:38:12 [2] Sorter initialized for 2699417 rows, 34+40 bytes each.
2010-07-02 14:38:21 [2] Sorted rows retrieved.
2010-07-02 14:38:21 [2] Displaying result: 10000 rows.
2010-07-02 14:38:22 [2] Total data packs actually loaded (approx.): 5574
2010-07-02 14:38:22 [2]——————————————————————————————————————

===================================================================================

Its the same query being run both tables but giving incorrect result with unsorted data set.
Am I doing something wrong or am I hit a potential bug.

Kindly review the logs and let me know your inputs.

Thanks,
Manu

Profile
 
Posted: 03 July 2010 08:08 AM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

Some things to make sure about:
- Are there the same data? One of the query is on 31-mln row table, another is on 6-bln row table (including delete rows, if any). Please try to create two identical tables and load data (sorted and not sorted) to both.
- Please try the latest version of Infobright, maybe the problem is already solved.
- Please check whether it is not a matter of different collations / charsets. Grouping may be also case insensitive.

If the problem persists, the best way to solve it is to send us the data and query for tests, if it is possible.

Regards,

Signature 
Profile
 
Posted: 06 July 2010 08:46 AM   Ignore ]   [ # 3 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

HI Jakub,

Its the same file which is loaded into both the tables. The table with 31 million rows has the data for only one day and the one with 6 billion rows has data for around 200 days. The 31 million row is the temp table, which I created just to test the sorted/unsorted thing. Even in the 6-billion row table, the data loaded from unsorted file, gives the wrong result and the one loaded from sorted file, gives the correct result.

So, its the same data and the infobright version I am using is,

Server version: 5.1.40-log build number (revision)=IB_3.3.1_r6997_7017(ice)

Even if the same data is loaded into the same table (with only difference of sorted and unsorted data), it gives the same error. So it shouldn’t be a matter of different collations/charsets.

It looks that this is something to do with the memory, like how infobright is keeping the previously sorted rows in memory and and after what limit it flushes the data out of memory, before putting new rows in memory.

Please let me know your inputs.

Thanks,
Manu

Profile
 
Posted: 13 July 2010 04:20 PM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

Hi All,

I am still waiting for your response. I have temporarily solved the problem by using sorted data file(to load data it into the table), but needs a permanent solution.

Kindly let me know your inputs as this will decide our commitment with the product for future.

Thanks,
Manu

Profile
 
Posted: 14 July 2010 09:15 AM   Ignore ]   [ # 5 ]  
Administrator
Avatar
Rank
Total Posts:  38
Joined  2009-01-19

Hi Manu,

We have been looking at this; it has been difficult to reproduce without schema and data.

We have made some fixes based on what we could glean from your post, and those will be available in our upcoming 3.4.2 release. I would suggest the following:

1) Try the 3.4.2 release when it comes out
2) If possible, could you get us a sample of your data and schema so that we could attempt to get a reproducible case?

Let us know.

thanks
.g.

Profile
 
Posted: 14 July 2010 02:44 PM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

Hi Graham,

We will give it a try with 3.4.2, but just want to understand the cause of this error.

Regarding schema, you can create two table with around 9 columns. Load same data in each of the table. Load unsorted data in one file and sorted data in another table. Put large amount of data, around 30 million rows. Also, note that the first column is of string type, other 8 columns may be mix (int,char,float,double,etc.). Sort one of the sample file from which data has to be loaded on the basis of first column and keep another copy unsorted.

Load these files into two different tables.
Run groupby queries on the two with groupby on first column (string).

The table with sorted data gives the correct result while the unsorted gives incorrect result.

Unfortunately, I couldn’t share the data samples.

Hope this should help in reproducing the issue.
Kindly let me know your inputs/observations.

Thanks,
Manu

Profile
 
Posted: 15 July 2010 02:07 PM   Ignore ]   [ # 7 ]  
Member
RankRankRank
Total Posts:  106
Joined  2008-08-18

Hi,

Can you provide further detail information about:

* How wide is the string column,e.g. varchar(???)
* The max/min width of string column?
* count(distinct()) of string column?
* Does this string column contain UTF8 strings?
* Does this string column contain binary strings?
* What’s your “Collation_Server” setting in my-ib.cnf?
* What’s your DDL charset/collation definition (show create table xxx)?
* What aggregation function do you use when group by string column: max/min, count, sum, avg ... ?

I will try to generate some random data in our environment.

Thanks

Profile
 
Posted: 15 July 2010 04:57 PM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

DROP TABLE IF EXISTS `tickstore`.`trades_temp`;
CREATE TABLE trades_temp` (
  `symbol` char(10) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,
  `date` date NOT NULL,
  `time` time NOT NULL,
  `msecs` smallint(5) NOT NULL,
  `price` float NOT NULL,
  `size` mediumint(8) NOT NULL,
  `source` char(2) NOT NULL,
  `code` char(1) NOT NULL,
  `str1` float NOT NULL,
  `str2` float NOT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1;

Profile
 
Posted: 15 July 2010 05:00 PM   Ignore ]   [ # 9 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

“select symbol, date, sec_to_time((bucket+1)*60) as time, trd, val, volume, low, high from
( SELECT symbol,date,floor((time_to_sec(time)*1000+msecs-1)/60) as bucket, count(*) as trd, round(sum(price*size)/sum(size),6) as val, sum(size) as volume, round(min(price),4) as low, round(max(price),4) as high
  FROM trades where date = ? and size>0 and time>=‘09:30:00’ and time<‘16:00:00’ group by symbol, date, bucket
) a order by date, symbol, time;”;

Profile
 
Posted: 16 July 2010 10:49 AM   Ignore ]   [ # 10 ]  
Member
RankRankRank
Total Posts:  106
Joined  2008-08-18

Hi,

I notice `symbol` is using latin1_general_cs.

Can you re-create table (drop and create), using

create table xxx (
...
) engine=brighthouse charset=latin1 collate latin1_bin;

And run select statement again on both sorted, and non-sorted data?

Thanks

Profile
 
Posted: 16 July 2010 10:51 AM   Ignore ]   [ # 11 ]  
Member
RankRankRank
Total Posts:  106
Joined  2008-08-18

I guess the table is stock quotation, so the `symbol` could only consists of [A-Z], [0-9], -, _.

I will create my random data in this way.

Profile
 
Posted: 16 July 2010 11:25 AM   Ignore ]   [ # 12 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

The symbol only consisis of [A-Z] and ‘/’

Profile
 
Posted: 16 July 2010 11:27 AM   Ignore ]   [ # 13 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

In our table, I need latin1_general_cs, because the symbols are case sensitive.Actually, symbol consists of [A-Z],[a-z] and ‘/’.

Profile
 
Posted: 19 July 2010 08:22 AM   Ignore ]   [ # 14 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi!

latin1_bin is case sensitive and should work faster than latin1_general_cs.

[ Edited: 19 July 2010 10:51 AM by Janusz Borkowski]
Profile
 
Posted: 19 July 2010 08:38 AM   Ignore ]   [ # 15 ]  
Newbie
Rank
Total Posts:  24
Joined  2010-06-01

Thanks, but even with that its giving the incorrect result.

Profile
 
   
1 of 2
1