hello, I tought ICE was ideal for genomic research where you have 10th of thousands of columns storing genetic data for a few hundreds or thousands of samples. So I installed the software to try it on my 64 bit linux box with 4 GO of RAM. I was able to load data sets with a limited number of columns but rapidly reached a limit. For example when I create a mysql table with 5000 columns using the INFOBRIGHT engine, I receive the message that the number of columns is too large. Is there a way to circumvent this limitation or is it impossible ?
Thank you in advance
Francois
ICE was not intended to have column number limitation lower than 100,000 (which is currently caused by data file naming convention). However, looks like MySQL does not allow to create tables exceeding its limit, which is 64 KB for a row. We will investigate it.
Nevertheless, even if we manage to relax a bit MySQL limitation and to allow creating more columns in Infobright engine, there will be still some limitations present. E.g. SELECT * would not work, because MySQL (with all its limitations) is responsible for displaying results. But exporting to file and analytical queries may work…
Thank you very much Jakub, this is what I suspected. It is a pitty to have a superb tool which should allow to treat problems that could not be handled previously using traditional DBMS and to be obliged to use it within the constraints of such DBMS. Don’t take this as a critic, I am sure evolution will be positive.
good luck
Francois
Can you tell us what kind of queries would you like to try on such database? I remember some articles where people did transposition of genomic data and operated on transposed table.
hello piotr,
transposing the table is not a good solution. It is much more convenient to represent the data with variables as columns and observations as rows. Let’s say I have a table with 50,000 columns with gene expression values and 2000 rows where the 50,000 values from each of the subject are stored. I want to rapidly retrieve each of column to analyse its content in relation to other information collected on each subject. I thought a column oriented database was ideal for such purpose.
Francois
Fully agree with your point. But I am trying to find some solution for your needs;) You can easily retrieve whole column from transposed table by using row_id (gene_id). Everything depends on what do you want to do next with this column, if to analyze it outside database or not.
Anyway, what is the type of data? Is it just value of gene expression from microarray (double, int?), or is it somehow discretized. One possibility is to play with data types to use 1-2 byte type for each column and match the constraint for 64KB row size.
Thanks,
Piotr
[ Edited: 28 October 2008 08:54 AM by Piotr Synak]
Actually, I did some research with gene expressions in the past. Therefore, I’m extremely interested in what you are doing.
One of possibilities is to slightly redefine the columns. If you add a new column GENE_ID, then you would be able to keep all the data in a more efficient structure, with far more records and with far less columns. The values for particular genes would be still easy to retrieve. The dependencies between genes et cetera would be still computable by appropriately designed SQL.
Thank you for your responses Piotr and Dominik.
Piotr: data are double for expression data and and can be smallint (or even smaller) for genotypes. My experience with transposed data sets using mysql is not good and what we do is use the file system as the database and store data as flat files (examples are .ped and .map datafiles for genome wide genotyping). This is not an ideal solution and it is why I was interested by the columns DBMS. I am looking for long term and general solution because this kind of data will accumulate rapidly in the months to come in our laboratory. A typical study is one were you have 5000 individuals with 1 million genetic markers, 30.000 gene expression measurements (probe signals), a few hundreds of other phenotypes and you want to find relationships in all this. Before analysis, data management is an important issue to resolve.
Dominik, we tried the solution you propose. We put the data of each column one after the other so in an expression matrix of 30000 probes in 5000 individuals we have 3 columns : columns 1, identifies the probe, column 2 the individual and column 3 the expression value. The 2 first columns may be used as an index. I have not really tested how fast, information can be retrieved from such table. May be we should use a solution like “Hypertable” after all.
I may be wrong but I thought that a column-based DBMS would provide an efficient approach to our problem. Meanwhile flat files offer a usable solution.
thank you for your interest
Francois
Thank you for your responses Piotr and Dominik.
Dominik, we tried the solution you propose. We put the data of each column one after the other so in an expression matrix of 30000 probes in 5000 individuals we have 3 columns : columns 1, identifies the probe, column 2 the individual and column 3 the expression value. The 2 first columns may be used as an index. I have not really tested how fast, information can be retrieved from such table. May be we should use a solution like “Hypertable” after all.
I may be wrong but I thought that a column-based DBMS would provide an efficient approach to our problem. Meanwhile flat files offer a usable solution.
Francois
The beauty of ICE is that it somehow combines the advantages of column-oriented and row-oriented storage. Indeed, with large number of columns and few-column-based queries, the column-oriented storages are better than the row-oriented ones. However, the ICE’s knowledge grid additionally optimizes the queries being selective with respect to the rows. So, with an additional condition on the gene index, ICE will go down precisely to the subset of expressions related to the given gene, with no need of looking through the whole column, like it happens in other column-oriented storages.
I wish we could keep in touch because it would be nice to compare performance of the analysis you are interested in when based on various data models and with various queries. Whenever you have any questions how to come up with SQL statements that would work efficiently with ICE to address your requirements, I would be happy to try to answer. And if I’m not able to answer, we can always ask some of my former PhD students involved into the gene expression analysis, so maybe they would be able to help us…
Dominik, thank you for this response, as soon as I have a little time I will try the 3 columns solution with mysql and compare the INFOBRIGHT and MyISAM engines.
regards
Francois