Joinutility seperatorLogin utility separator Infobright.com
   
 
scaling
Posted: 06 February 2009 08:30 PM   Ignore ]  
Newbie
Rank
Total Posts:  15
Joined  2009-01-29

I have a question about scaling.  On the front page of infobright.com there is a mention of the word scaling, but does infobright scale to more than one server?  Can queries be executed across multiple machines?

Ankur

Profile
 
Posted: 11 February 2009 08:25 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi Ankur,

Scalability in the case of Infobright means ability of maintaining stable performance pattern when database grow.
We are trying e.g. to maintain the same load speed for the first and the 10-th terabyte of input data, including all the overhead connected with Knowledge Grid etc. And our QA routine tests confirm this ability.
Another dimension of scalability is to maintain speed of queries which regard only a part of data, at the same level regardless how much rows we’re going to add in the future. E.g. if we are calculating statistics for a given month, it should take the same time regardless we have 1 year or 10 years of data in our fact table.
It is also about avoiding rapid performance loss when some hardware limitations are overpassed. Some degradation will always occur when e.g. a query uses more data than available RAM, but the goal is to keep it reasonable.

Infobright currently have no ability to be distributed into many machines, although it is planned in the future - it will be a scalability oriented on a number of concurrent users.

Regards,

Signature 
Profile
 
Posted: 22 March 2009 01:36 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  1
Joined  2009-03-22

Regarding scaling, after loading up a days worth of data and getting very positive results, I tried loading up some more (15 days) which is around 15 million rows. With this amount of data the query speed takes a big hit. The server is a dual Opteron 2212 (1.8 GHz)v with 4 GB of RAM running Centos 5.2 x64.

The table schema for the fact table is:

CREATE TABLE `page_impressions` (
  `service_id` int(10) DEFAULT NULL,
  `uid` char(32) DEFAULT NULL,
  `date_key` int(10) DEFAULT NULL,
  `date` int(10) DEFAULT NULL,
  `hour` tinyint(2) DEFAULT NULL,
  `page` char(100) DEFAULT NULL,
  `portion` int(10) DEFAULT NULL
) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=utf8

One query that is very important for us is:

SELECT page, COUNT(DISTINCT(uid)) AS count FROM page_impressions WHERE service_id = 49 AND date_key BETWEEN 2009020100 AND 2009020123 GROUP BY page ORDER BY count DESC;

With a days worth of data loaded this takes under 5 sec. to run. With 15 million rows it takes between one and two minutes. Now I realise 4 GB of RAM is less than the recommended amount for production servers, but what happens when the rows reach 100 million or 200 million? Is there a rule to calculate how much RAM will be needed as the size of the table grows?

Also I know that CHAR columns are not optimal, but I have over 300k distinct values in the page column so LOOKUP would probably destroy performance.

Profile
 
Posted: 22 March 2009 01:53 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi Einarb,

Thanks for trying it on full scale.
The problem with COUNT(DISTINCT) operations is that it is inevitably nonlinear, as well as e.g. sorting. The grouping on 10 days date interval will be more than 10 times slower than the same query for one day. In your case I think it is also an effect of swapping caused by low memory avaliable - you may try to find it out in “bh.err” logs. More than one pass through data is needed if the COUNT(DISTINCT) algorithm is out of memory.

I hope you are loading data more or less chronologically. In this case at least operation within one particular day should not be slower when 100 more days arrive. (Otherwise please send us “bh.err” log to see where the problem might be.)

Regards,

Signature 
Profile
 
Posted: 23 March 2009 05:38 AM   Ignore ]   [ # 4 ]  
Super Duper Member
Avatar
RankRankRankRankRank
Total Posts:  916
Joined  2008-08-18

Hi !

I would only add that memory settings in brighthouse.ini may play an important role. For example if you have MainHeapSize=3000 and CompressedHeapSize=300, then ICE takes 3300MB immediately, leaving only 700MB for the system and other programs. This is fine as far as query processing can fit within the reserved 3300MB. For larger data sets (and for multiuser scenarios) the processing can require more memory. Here we have two situations:
a) the algorithm can notice that it has limited memory and does some multipass processing (as Jakub mentioned) - this is a significant slow down, but not so bad as the swapping is controlled by the algorithm
b) An algorithm needs a memory buffer but all the reserved 3300MB are already in use, so an additional buffer must be claimed. ICE starts to use more memory than it is installed in the machine, causing system controlled swapping. This is very bad for the performance.

The b) scenario can be detected by using e.g. ‘top’ or ‘vmstat’ tools. To prevent system controlled swapping, decrease declared MainHeapSize and CompressedHeapSize (server restart is necessary).

I confirm that for 300k distinct values LOOKUP is not an option at all.

Hope it helps..

Profile
 
Posted: 27 March 2009 01:12 AM   Ignore ]   [ # 5 ]  
Newbie
Rank
Total Posts:  42
Joined  2009-03-26

Hi Jakub,

May I have the same problem in this topic.
My server is a Xeon 4 core with 16GB of RAM running Centos 5 x64bit.
And my table design is:

CREATE TABLE `logtemp` (
  `Timecreate` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ‘Size[MB]: 13.0; Ratio: 537.25’,
  `browser_code` TINYINT(4) DEFAULT NULL COMMENT ‘Size[MB]: 49.5; Ratio: 8.36’,
  `browser_name` VARCHAR(50) COLLATE utf8_bin DEFAULT NULL COMMENT ‘Size[MB]: 480.2; Ratio: 3.79’,
  `browser_ver` VARCHAR(20) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 939.4; Ratio: 1.49’,
  `os_name` VARCHAR(50) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 105.3; Ratio: 24.84’,
  `os_code` TINYINT(4) DEFAULT NULL COMMENT ‘Size[MB]: 4.7; Ratio: 87.12’,
  `os_ver` VARCHAR(20) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 328.1; Ratio: 3.51’,
  `bannerid` VARCHAR(100) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 49.7; Ratio: 60.13’,
  `ip` INTEGER(11) NOT NULL COMMENT ‘Size[MB]: 783.4; Ratio: 1.88’,
  `domain` VARCHAR(150) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 1383.2; Ratio: 3.32’,
  `url-path` VARCHAR(255) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 12020.1; Ratio: 1.30’,
  `clickorview` TINYINT(4) DEFAULT NULL COMMENT ‘Size[MB]: 0.5; Ratio: 704.55’,
  `cookie_create` TIMESTAMP NOT NULL DEFAULT ‘0000-00-00 00:00:00’ COMMENT ‘Size[MB]: 870.1; Ratio: 8.03’,
  `guid` VARCHAR(50) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 13079.1; Ratio: 0.70’,
  `day` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 0.2; Ratio: 999.99’,
  `month` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 0.2; Ratio: 999.99’,
  `year` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 0.2; Ratio: 999.99’,
  `hour` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 0.3; Ratio: 999.99’,
  `minute` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 1.0; Ratio: 999.99’,
  `is_new` INTEGER(11) DEFAULT NULL COMMENT ‘Size[MB]: 18.1; Ratio: 83.45’,
  `geographical` VARCHAR(200) COLLATE utf8_general_ci DEFAULT NULL COMMENT ‘Size[MB]: 2.3; Ratio: 179.01’
)ENGINE=MyISAM
ROW_FORMAT=COMPRESSED CHARACTER SET ‘utf8’ COLLATE ‘utf8_general_ci’
COMMENT=‘Overall compression ratio: 2.228’;
In fact, we use IB for analytic system and the logtemp table may growth about 20 millions of rows per day.

To calculate the totalview in one day we use this query:

select count(guid
from `allvc`
WHERE clickorview=0
AND (bannerid '12345678')
AND (
timecreate between '2009-03-20 00:00:00' and '2009-03-20 23:59:59')
ORDER BY NULL 

Result: 20 586 598
1 rows fetched (4.906 sec)

To calculate totalview in 4 days (from 20th -> 24th)

select count(guid
from `allvc`
WHERE clickorview=0
AND (bannerid '12345678')
AND (
timecreate between '2009-03-20 00:00:00' and '2009-03-24 23:59:59')
ORDER BY NULL 

Result: 95 731 121
1 rows fetched (37.218 sec)

total Unique view in one day (total visitor)

select count(DISTINCT guid
from `allvc`
WHERE clickorview=0
AND (bannerid '12345678')
AND (
timecreate between '2009-03-20 00:00:00' and '2009-03-20 23:59:59')
ORDER BY NULL 

Result: 1 588 993
1 rows fetched (00:01:57)

But when I try to calculate total visitor in 4 days the time return too long, some time the query making IB engine down. So, Can you give me any ideas for this problem? What can I have to do for increase performance?

Regard,

Thanh Nguyen.

[ Edited: 27 March 2009 01:57 AM by Thanh Nguyen]
Signature 

Thanh Nguyen

skype: thanhntvt
Website: Admicro.vn

Profile
 
Posted: 27 March 2009 03:22 AM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi Thanh,

Counting distinct varchar(50) values on 95 mln rows is never an easy task…
I would ensure that the proper memory settings are set in brighthouse.ini. In your case MainHeapSize=12000 should be efficient (with system swapping minimized). Another idea would be to try the following:

select clickorviewcount(DISTINCT guid
from `allvc`
WHERE clickorview=0
AND (bannerid '12345678')
AND (
timecreate between '2009-03-20 00:00:00' and '2009-03-20 23:59:59')
GROUP BY clickorview

May look weird, but count(distinct) inside group by in some cases works better than without it. In a new versions of ICE this inefficiency will be fixed. And the result will be the same as we have just one value of clickorview. Please let us know if it helped.

Regards,

Signature 
Profile
 
Posted: 27 March 2009 05:49 AM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  42
Joined  2009-03-26

Hi Jakub,

Thank for your quick reply. I tried to use “group by clickorview” and the result so great if the timecreate range in a day (‘2009-03-20 00:00:00’ and ‘2009-03-20 23:59:59’).
Here the result:
1 588 993
1 rows fetched (23.375 sec)  (faster than (00:01:57) when query with out “group by clickorview” that I posted above).

But when I tried to do this one with timecreate range in 4 days (‘2009-03-20 00:00:00’ and ‘2009-03-24 23:59:59’) the time return too long again (about more than 10 minutes).

Last, I tried to map from varchar(50) to numeric, it should be the best way to change right now, isn’t it?

PS: I changed the MainHeapSize (in fact it is the ServerMainHeapSize?) = 12000.

Signature 

Thanh Nguyen

skype: thanhntvt
Website: Admicro.vn

Profile
 
Posted: 27 March 2009 08:10 AM   Ignore ]   [ # 8 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  736
Joined  2008-08-18

Hi,

Changing to numerics, if possible, is always a good idea in this case.

Regards,

Signature 
Profile
 
Posted: 03 July 2009 01:59 AM   Ignore ]   [ # 9 ]  
Newbie
Rank
Total Posts:  20
Joined  2009-06-19

Hi,

  I have 2 table with 60000000 rows. one tables is infobright table another table is myisam table. 2 tables have same values.

now i have retrive the data from the both table…. i wondered… infobright table take more time than myisam table… why?

i have been using the following query only.

select id,count(code) from tbl_fact group by id limit 10;

how to improve the query speed in infobright?

How to set the optimized server variable.?

I have 4GB of RAM in that machine. what is the optimized server vaiables.?

can u any one help me…?


Thanks.
S.Ashokkumar

Profile
 
Posted: 08 July 2009 11:10 AM   Ignore ]   [ # 10 ]  
Member
Avatar
RankRankRank
Total Posts:  191
Joined  2008-08-18

Hi,

Try to compare speed of the query without LIMIT. The problem is that limit has not been yet well optimized for group by queries. Do you use limit for comparison only or you need it in your production environment?

Thanks,
Piotr

Profile