Joinutility seperatorLogin utility separator Infobright.com
   
 
Biomedicine and healthcare
Posted: 25 November 2008 07:57 AM   Ignore ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello,

I noticed some interest in biomedical and healthcare-related data across the forums.

See for example the earlier posts about the gene expression data:

http://www.infobright.org/Forums/viewthread/265/

How about merging further relevant discussions into a single thread?

I believe we may learn a lot from each other in the following two areas:

—What are major aspects of data warehousing applications in biomedicine and healthcare?

—Is Infobright’s technology a good choice for such types of applications?

In particular, I looked around and found the following links that may be useful:

1. http://hdwa.org/hdwa/home/—Healthcare Data Warehousing Association (HDWA).

2. http://www.bio-medicine.org/—I’ve put “data warehouse” into the search window.

3. http://www.bioinformatics.org/—I’ve done the same as above.

I’m sure that people working actively in these areas may know even more interesting, specific links…

Best greetings,

Dominik

Signature 
Profile
 
Posted: 01 April 2009 11:45 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  602
Joined  2008-08-18

Hello,

There was an interesting problem posted to one of newsgroups. Someone wants to store and analyze genetic data of the following characteristics:

- number of patients: several thousands, up to 30000,
- number of genes: 2000000,
- values of genes: ‘AA’ or ‘AB’ or ‘BB’ or null, followed by 0 or 1. About 40% of values will be ‘AB1’, so the distribution is not uniform,
- additional requirements for possibility of multiuser access, read only,
- updates by adding patients with a complete gene description, slight chance of adding genes.

In the next post I will describe my experiments.

Regards,

Signature 
Profile
 
Posted: 01 April 2009 12:05 PM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  602
Joined  2008-08-18

I suggested to use ICE with the following schema:

CREATE TABLE genes1 (
  
patient_id   int,
  
gene_id      int,
  
gene_val     char(3comment 'lookup'
engine=brighthouse

I’ve prepared a script to generate and load artificial data. In a medium-scale experiment I’ve generated 10000 patients, which gives 306 GB of text input files. Generating these random data took 12.5 h, then loading 5 h.

Compression 1:50 (well, not a big achevement given that e.g. first 2 mln rows have the same patient_id), i.e. the database folder is 5.9 GB. Note that we have 20 bln random gene values in the database, so the ‘real’ (problem-oriented) compression is about 1:3.4.

Some tests and results on my Core 2 Duo, 6 GB RAM.
General statistics:

mysqlselect count(*) from genes1;
+-------------+
count(*)    |
+-------------+
20000000000 
+-------------+
1 row in set (0.05 sec)

mysqlselect count(distinct patient_idfrom genes1;
+----------------------------+
count(distinct patient_id) |
+----------------------------+
|                      
10000 
+----------------------------+
1 row in set (50.52 sec

How efficient may be the access to a particular patient or some of his genes:

mysqlselect gene_idgene_val from genes1 where patient_id 7319 
     
into outfile '~/genes_of_patient_7319';
Query OK2000000 rows affected (1.42 sec)

mysqlselect gene_idgene_val from genes1 where patient_id 1219 
     
> and gene_id between 478283 and 478290;
+---------+----------+
gene_id gene_val |
+---------+----------+
|  
478283 |   0      
|  
478284 AB1      
|  
478285 AB1      
|  
478286 AA1      
|  
478287 AB1      
|  
478288 |   1      
|  
478289 BB0      
|  
478290 |   0      
+---------+----------+
8 rows in set (0.54 sec

Fast enough. But - what is understandable - it is much harder to scan through one particular gene across all patients. Many data packs to be accessed:

mysqlselect patient_idgene_val from genes1 where gene_id 1522311 
     
into outfile '~/gene_1522311_of_all_patients';
Query OK10000 rows affected (1 min 41.79 sec

Similarly, statistics on a given patient are calculated much faster than statistics of a gene:

mysqlselect gene_valcount(*) from genes1 where patient_id 3722 group by gene_val;
+----------+----------+
gene_val count(*) |
+----------+----------+
AA0      |    79534 
BB1      |   319298 
BB0      |    80408 
AB1      |   801290 
AA1      |   319691 
AB0      |   199729 
|   
1      |   160068 
|   
0      |    39982 
+----------+----------+
8 rows in set (0.85 sec)

mysqlselect gene_valcount(*) from genes1 where gene_id 1900000 group by gene_val;
+----------+----------+
gene_val count(*) |
+----------+----------+
AA0      |      396 
BB1      |     1648 
BB0      |      406 
AA1      |     1600 
AB1      |     3986 
AB0      |      990 
|   
1      |      775 
|   
0      |      199 
+----------+----------+
8 rows in set (1 min 30.14 sec

Something harder: calculate a similarity of two given patients (say, id. 1000 and 6000). The similarity is defined as a number of genes with identical values.

(Fasten your belts - we are joining two tables 20 bln rows each!)

mysqlselect count(*) from genes1 t1genes1 t2 where t1.patient_id 1000 and t2.patient_id 6000 
     
> and t1.gene_id=t2.gene_id and t1.gene_val=t2.gene_val;

+----------+
count(*) |
+----------+
|   
462041 
+----------+
1 row in set (2.19 sec

To be honest, the last query is not working properly in all ICE releases. In an older one I’ve got “out of memory”. It was already fixed in 3.1 GA, as I remember.

Any ideas for more tests?

Regards,

Signature 
Profile
 
Posted: 01 April 2009 09:59 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Kuba,

This is amazing. I didn’t fasten my seatbelt and I nearly fell off the chair looking at the last query. I mean that, on the one hand, it’s something we should expect for these types of queries given our architecture specifics. On the other hand, however, it’s still so impressive!

In your earlier email, you mentioned url of that newsgroup:

http://groups.google.pl/group/pl.comp.bazy-danych/browse_thread/thread/1ec6cd59b2a93fcb#

It’s worth visiting for those who know Polish.

Many thanks and best greetings,

Dominik

Signature 
Profile
 
Posted: 29 May 2009 09:16 AM   Ignore ]   [ # 4 ]  
Member
Avatar
RankRankRank
Total Posts:  154
Joined  2009-04-01

I presented our technology to a group of local data warehouse professionals (http://www.irmac.ca/) yesterday evening here in Toronto.

One individual was very interested in using ICE on her project.. The project is mapping gene sequences and is really struggling with data storage. They are hoping to get huge benefits from our compression, and also use ICE to analyze the multi terabytes of data they collect.

I will let you know if I here anything more about their progress.

[ Edited: 04 October 2009 08:49 PM by Bob Newell]
Profile
 
   
 
 
     Online Marketing ››