I suggested to use ICE with the following schema:
CREATE TABLE genes1 (
patient_id int,
gene_id int,
gene_val char(3) comment 'lookup'
) engine=brighthouse;
I’ve prepared a script to generate and load artificial data. In a medium-scale experiment I’ve generated 10000 patients, which gives 306 GB of text input files. Generating these random data took 12.5 h, then loading 5 h.
Compression 1:50 (well, not a big achevement given that e.g. first 2 mln rows have the same patient_id), i.e. the database folder is 5.9 GB. Note that we have 20 bln random gene values in the database, so the ‘real’ (problem-oriented) compression is about 1:3.4.
Some tests and results on my Core 2 Duo, 6 GB RAM.
General statistics:
mysql> select count(*) from genes1;
+-------------+
| count(*) |
+-------------+
| 20000000000 |
+-------------+
1 row in set (0.05 sec)
mysql> select count(distinct patient_id) from genes1;
+----------------------------+
| count(distinct patient_id) |
+----------------------------+
| 10000 |
+----------------------------+
1 row in set (50.52 sec)
How efficient may be the access to a particular patient or some of his genes:
mysql> select gene_id, gene_val from genes1 where patient_id = 7319
> into outfile '~/genes_of_patient_7319';
Query OK, 2000000 rows affected (1.42 sec)
mysql> select gene_id, gene_val from genes1 where patient_id = 1219
> and gene_id between 478283 and 478290;
+---------+----------+
| gene_id | gene_val |
+---------+----------+
| 478283 | 0 |
| 478284 | AB1 |
| 478285 | AB1 |
| 478286 | AA1 |
| 478287 | AB1 |
| 478288 | 1 |
| 478289 | BB0 |
| 478290 | 0 |
+---------+----------+
8 rows in set (0.54 sec)
Fast enough. But - what is understandable - it is much harder to scan through one particular gene across all patients. Many data packs to be accessed:
mysql> select patient_id, gene_val from genes1 where gene_id = 1522311
> into outfile '~/gene_1522311_of_all_patients';
Query OK, 10000 rows affected (1 min 41.79 sec)
Similarly, statistics on a given patient are calculated much faster than statistics of a gene:
mysql> select gene_val, count(*) from genes1 where patient_id = 3722 group by gene_val;
+----------+----------+
| gene_val | count(*) |
+----------+----------+
| AA0 | 79534 |
| BB1 | 319298 |
| BB0 | 80408 |
| AB1 | 801290 |
| AA1 | 319691 |
| AB0 | 199729 |
| 1 | 160068 |
| 0 | 39982 |
+----------+----------+
8 rows in set (0.85 sec)
mysql> select gene_val, count(*) from genes1 where gene_id = 1900000 group by gene_val;
+----------+----------+
| gene_val | count(*) |
+----------+----------+
| AA0 | 396 |
| BB1 | 1648 |
| BB0 | 406 |
| AA1 | 1600 |
| AB1 | 3986 |
| AB0 | 990 |
| 1 | 775 |
| 0 | 199 |
+----------+----------+
8 rows in set (1 min 30.14 sec)
Something harder: calculate a similarity of two given patients (say, id. 1000 and 6000). The similarity is defined as a number of genes with identical values.
(Fasten your belts - we are joining two tables 20 bln rows each!)
mysql> select count(*) from genes1 t1, genes1 t2 where t1.patient_id = 1000 and t2.patient_id = 6000
> and t1.gene_id=t2.gene_id and t1.gene_val=t2.gene_val;
+----------+
| count(*) |
+----------+
| 462041 |
+----------+
1 row in set (2.19 sec)
To be honest, the last query is not working properly in all ICE releases. In an older one I’ve got “out of memory”. It was already fixed in 3.1 GA, as I remember.
Any ideas for more tests?
Regards,