I got one table with 16000 rows and one with 40 000 000. ICE 3.4.2.
In order to utilize our 16 cores (32gb mem) I’m thinking of if there are any gains in splitting the job over cores.
select be.id, be.segment, be.timeindex, td.somecol
from (select id, segment, timeindex from events limit 7501,2500) be
, data td
where be.id = td.id
and td.timeindex between be.timeindex-4000 and be.timeindex
and td.somecol > 0
group by be.segment
order by id, td.timeindex;
Since the events table contains 16000 post I can split these into chunks of 2500 rows.
The timeindex in this case means 4 sec and each join will return 40 rows (in total 16000*40).
Running the query without limit clause took 739 sec. When starting 7 simultaneous processes with 2500 rows from events I reduced the total time to 467 sec (some finishing in 167 sec and the last to finish in 467 sec). Although the execution time was only 63% of the original query I wonder if someone has any ideas on how to optimize this?
And also to hear your best trick to fully use multicore environments?
Infobright Enterprise Edition does support multi-core, but ICE does not.
You may also want to try to remove the join between these two tables. One one extreme, you can try to condense the dataset to potentially one table (the compression should help alleviate space needs).
First of all, I would avoid using a subquery. Joining with a subquery result means unnecessary materialization of data, and disabling most of Knowledge Grid optimizations. If the ‘events’ table consists of any kind of key column, please use it in WHERE part instead of LIMIT. Suppose that be.timeindex is such a column:
select be.id, be.segment, be.timeindex, td.somecol from events be, data td where be.id = td.id and td.timeindex between be.timeindex-4000 and be.timeindex and td.somecol > 0 AND be.timeindex BETWEEN [some values defining a 2500-row fragment] AND td.timeindex BETWEEN [similar condition, implication of the above] group by be.segment order by id, td.timeindex;
Query Execution Log is also a good tool for identifying bottlenecks.
So I rewrote the query, still determined to spread the query over cores
select be.id, be.segment, be.timeindex, td.somecol
from events be
, data td where be.id < 1000
and be.id = td.id
and td.timeindex between be.timeindex-4000 and be.timeindex
and td.somecol > 0
group by be.segment
order by id, td.timeindex;
And then replacing highlighted into two other querysessions with
* between 1000 and 2000
* > 2000
The first query as well as the third was done in 290 sec. Using between took 591 (looking in bh.err it seems like a traditional full table scan on the data table)!
But by replacing “be.id between 1000 and 2000” with
“be.id >=1000 and be.id <= 2000”
the query was as fast as the other ones running in 291 sec.
I thought that “between” was the problem and replaced also the between on timeindex but that made no difference, still 291 sec.
But to summarize the query of 739 sec now elapses in 291 sec using three cores
BETWEEN and the pair of inequalities should be executed in exactly same way, and I would expect that caching mechanism is responsible for any differences in time.
Jakub: I ‘m interested in what you mean about caching. My understanding is that there are no caching in the architecture? If I run one query a number of times it will always take the same number of seconds?
Jakub means that the most lately queried data packs remain in memory until you restart the server or until memory needs to be freed in order to put there other data packs.
Therefore, an important question is whether you restarted the server before re-running your queries with different “between” and “inequality and inequality” options. If you didn’t, then you might see that the same (or similar) query suddenly runs faster. But if you did, then the difference in times between “between” and “inequality and inequality” options would be truly mysterious.
By the way, thanks a lot for answering my question about different inequalities. All these cases are significantly faster than the result that you originally reported for “between”. Thus, the first guess is that some data packs are already in memory. The best way to proceed is therefore to check all “between” and “inequality and inequality” options one more time, but making sure that the server is restarted after each single query.
running three parallel queries:
restart
started first: “be.id <1000 ” -> 278 sec
started second: “be.id >= 1000 and be.id <= 2000” -> 357 sec
started third: “be.id > 2000” -> 356 sec
I cannot get a grip of the caching, we have 36 gb memory, Server Main Heap Size set to 24000. When running these three queries I get about 2% used memory. Caching seems not to play a big difference in these cases (if I have not done anything wrong in the setup).
To make things more complicated - but more realistic - there are 2 cache mechanisms involved.
1. Operating system reads needed files and stores their content in memory. Subsequent reads targeting same files do not need to get data from disk, they get it from system cache. For some queries this makes big difference - eg. if reading from disk is a bottleneck, so CPUs are working at 50%, with data in system cache disk is no more a bottleneck and CPU can work at 100%
2. Data read from disk by IB is compressed. It gets decompressed and stored in application-level cache. A query needing data already present in this cache does not need to use disk and decompression at all - data is ready to be used.
Both caches are managed independently - the first one by OS, the second one by IB. To clear the first one in Linux either restart system or do
echo 3 > /proc/sys/vm/drop_caches (needs root). To clear the second cache restart IB.