Joinutility seperatorLogin utility separator Infobright.com
   
 
logging measured values
Posted: 06 May 2009 12:39 PM   Ignore ]  
Newbie
Rank
Total Posts:  7
Joined  2009-05-06

Hi,

my question is similar to a previous topic, but its from october and not discussed anymore, so i created a new one…

I’m new to ICE and data warehouses and I’m not sure if ICE is the right choice. Actually I have a sql db where I’m logging different datapoints with one or more measured values from some plc’s. Every 10-20 seconds I’m logging about 100-1000 new datapoints. On every insert to the fact table, I’m aggregating the data (avg, min, max, sum) in an 5min, 1hour and a daily archive. The queries are like

'select sum(counter1) from daily_archive where datapointid = 1 and year = 2009 and month = 2' 

to get the total amount over one month
or

'select avg_val1, min_val1, max_val1 from 5minarchive where datapointid = 1 and ts > '2009-02-11 10:00:00' and ts < '2009-03-11 10:00:00 

to create a linechart with the specified values.

Can I use ICE for such a scenario?

Is it possible to tell ICE for example, that I need the data with a resolution of 10-15 seconds only a few weeks, the hourly aggregated data a few years?

I read that there are no INSERT’s, currently I have a c++ app, which collects the data from the different plc’s and makes an INSERT. With ICE I should log this data to a csv-file and then load it with the ICE data loader or am I wrong? Can I call the loader every 5 minutes?

would the tabledesign be something similar to the following and the aggregation is done in the queries:

datapoint_id
timestamp
dim_year
dim_month
dim_week
dim_dayofweek
...
counter1
counter2
...
value1
value2
... 

(every datapoint can have one or more counters and one or more analog values)

 

Thank you for your help!

Regards,
Willi

Profile
 
Posted: 06 May 2009 03:29 PM   Ignore ]   [ # 1 ]  
Jr. Member
Avatar
RankRank
Total Posts:  96
Joined  2008-08-18

I assume PLC means Programmable Logic Controller.  From what I understand with PLCs, is that they give a bunch of bits on the line.  Yes, you could store these ones and zeroes in a row.  Yes, you could load thousands of rows at a time.

You could create columns to mark each row with year, month, week, time(stamp), etc.

Yes, with ICE, you would have to load from a text file (e.g. .csv) and then query.

Yes you could load every 5 minutes, but experiment to make sure that the loader is finished doing the previous load.  It depends on how much you’re loading at a time.  The ICE loader should load a thousand rows in a matter of seconds.  The IEE loader is almost two times faster.

You’re concept/idea for the table definition looks correct.

If each row/datapoint in the table can have at least one counter or analog value without a maximum number of counters or analog values, then you may want to consider having one column in the table reference/link a secondary table where you would store the counter or analog values.

You may also want to consider loading from a pipe:
http://www.infobright.org/Forums/viewthread/153/

Signature 

Brian Beharry, QA
Infobright

Profile
 
Posted: 06 May 2009 04:20 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  7
Joined  2009-05-06

Hi Brian,

thank you for your fast reply!

Yes PLC means Programmable Logic Controller…

Maybe I should rename the ‘datapoint’ to ‘datastruct’ and a possible datastruct would be an electr. engine with a counter for the consumed energy, an analog value for the total power and three analog values for the voltage of each phase. The maximum number of counters and analog values is limited so I can use only one table.

Unfortunately I’m not using linux, so I have no pipes…

What about deleting old rows or aggregate them to daily values, otherwise the table is growing and growing? With 1000 datastructs and 1 row for each datastruct every 15 seconds I will get 2billion rows per year.

Regards,
Willi

Profile
 
Posted: 06 May 2009 04:41 PM   Ignore ]   [ # 3 ]  
Jr. Member
Avatar
RankRank
Total Posts:  96
Joined  2008-08-18
hafwil - 06 May 2009 04:20 PM

Hi Brian,

thank you for your fast reply!

.
. .

Unfortunately I’m not using linux, so I have no pipes…

What about deleting old rows or aggregate them to daily values, otherwise the table is growing and growing? With 1000 datastructs and 1 row for each datastruct every 15 seconds I will get 2billion rows per year.

Regards,
Willi

Regarding deletion of rows, you won’t be able to delete with ICE, but:
1) You can DROP TABLE and
2) I bet ICE can easily hold your 2 billion rows per year.  You only need enough RAM and disk space (ICE compresses for disk)

Regarding aggregation, at the end of your day, you could export the data as aggregates, drop the data for the day, then load the exported aggregates for the day.

Also, there’s a decent data generator here that can generate 2 billion rows for you for testing/experimentation:
http://www.datamelt.com/java-datamelt/process?action=bsh&scriptname=downloads

Signature 

Brian Beharry, QA
Infobright

Profile
 
Posted: 06 May 2009 04:58 PM   Ignore ]   [ # 4 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello Willi,

Let me second Brian’s opinion.

If you have enough disk space, there will be no problem with 2 billion rows. We’d be very interested in compression ratio that you can achieve. It should be pretty good. Also, it looks like your queries would be just moderately memory/CPU-consuming. Hence, it should be really fine.

Otherwise, as Brian says, you can use the drop-table-based strategy. Given tables with detailed data for every single day (within the period of last few weeks), you would be able to merge the required results somehow. However, storing the complete detailed data seems to be far cleaner solution, also enabling some extra opportunities for data analysis.

Many thanks and best greetings,

Dominik

Signature 
Profile
 
Posted: 06 May 2009 05:28 PM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi Willi,

Your use case looks like tailored especially for Infobright.
I wouldn’t worry neither about storage nor query speed for several billions of rows (however a 64-bit machine, 4 GB RAM or more, would be welcome in this case). As you’re going to load data chronologically, all queries with time-based conditions will work fast thanks to Knowledge Grid. As a small addition to what others already said: if your data are not integers, I would suggest to store data points as decimals (fixed precision) rather than floats. Floating point operations are less optimized in ICE.

Regards,

Signature 
Profile
 
Posted: 07 May 2009 03:39 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  7
Joined  2009-05-06

Hi,

I did now some first tests and I’m very surprised!

That’s my table:

CREATE TABLE `testtable` (
  `
idint NOT NULL,
  `
tstimestamp NOT NULL,
  `
dim_yearint NOT NULL,
  `
dim_quartaltinyint(4NOT NULL,
  `
dim_monthtinyint(4NOT NULL,
  `
dim_dayofweektinyint(4NOT NULL,
  `
dim_daytinyint(4NOT NULL,
  `
dim_hourtinyint(4NOT NULL,
  `
dim_quarterhourtinyint(4NOT NULL,
  `
dim_mintinyint(4NOT NULL,
  `
dim_sectinyint(4NOT NULL,
  `
counter1decimal(10,5NOT NULL,
  `
counter2decimal(10,5NOT NULL,
  `
counter3decimal(10,5NOT NULL,
  `
counter4decimal(10,5NOT NULL,
  `
val1decimal(10,5NOT NULL,
  `
val2decimal(10,5NOT NULL,
  `
val3decimal(10,5NOT NULL,
  `
val4decimal(10,5NOT NULL,
  `
val5decimal(10,5NOT NULL,
  `
val6decimal(10,5NOT NULL,
  `
val7decimal(10,5NOT NULL,
  `
val8decimal(10,5NOT NULL,
  `
val9decimal(10,5NOT NULL,
  `
val10decimal(10,5NOT NULL,
  `
val11decimal(10,5NOT NULL,
  `
val12decimal(10,5NOT NULL,
  `
val13decimal(10,5NOT NULL,
  `
val14decimal(10,5NOT NULL,
  `
val15decimal(10,5NOT NULL
ENGINE=BRIGHTHOUSE DEFAULT CHARSET=ascii COLLATE=ascii_bin

I tested only with 2million records on my notebook, but I just can say, that its a lot faster than the existing one. The table needs only 30MB on the disk, where the existing needs just 4-5 times more for the indexes only. The queries over the different dimensions are very fast! When I’ve done some tests on the real server (4GB RAM wink I will tell you the results.
Are there any other hints to my table structure, for example my selects depends always on the id, ts and/or the dimensions. Is this something I can/should tell ICE or its not necessary?
I can’t believe that creating a logtable can be so easy…

Regards,
Willi

Profile
 
Posted: 07 May 2009 04:23 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi Willi,

You don’t need to tell anything to ICE. But if your queries depend on id and ts, it is quite important to load data ordered by id / ts. The ordering may not be strict - it is fairly enough to have a sequence like this:

1, 2, 3, ...., 1000000, 3000000, 3000001, ... 4000000, 2000000, 2000001, 2999999 etc.,

i.e. a “partial order” is also good. A general rule is that data packs (which are 65536 rows long in ICE) should have possibly similar values on a key columns.

BTW, ‘id’ int NOT NULL may be not enough for billions of rows…

Regards,

Signature 
Profile
 
Posted: 07 May 2009 04:39 AM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  7
Joined  2009-05-06

Many Thanks Jakub!

The id is the identifier for the datastruct, so there will be max. 1000-2000 different ids.

Actually I loaded the data like:

id=0;ts=2009-01-01 00:00:00;....
id=1;ts=2009-01-01 00:00:00;....
....
id=1000;ts=2009-01-01 00:00:00;....
id=0;ts=2009-01-01 00:00:30;....
id=1;ts=2009-01-01 00:00:30;....
....
id=1000;ts=2009-01-01 00:00:30;.... 

and so on. I think I should try to get something like

id=0;ts=2009-01-01 00:00:00;....
id=0;ts=2009-01-01 00:00:30;....
id=0;ts=2009-01-01 00:01:00;....
....
id=1000;ts=2009-01-01 00:00:00;....
id=1000;ts=2009-01-01 00:00:30;....
id=1000;ts=2009-01-01 00:01:00;....
.... 


I’m eager to see how it works on the server! Thank you very much!

Profile
 
Posted: 07 May 2009 04:59 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  606
Joined  2008-08-18

Hi,

It’s worth trying both orderings - which is the better one depends on queries. A size of loaded files also matters: the bigger (ordered) inputs, the better performance. 100,000 rows is a reasonable minimum here.

Regards,

Signature 
Profile