Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

22
Dec

Joint Rough Set Conference and Data Mining Competition in 2012

Dominik Slezak's photo
by Dominik Slezak     Thu, Dec 22, 2011

Hello,

Some of my academic colleagues still remember the first Joint Rough Set (JRS) Conference held in 2007 in Toronto, opened by speeches of presidents/CEOs of York U, MaRS and Infobright. Obviously, international rough set conferences have far longer tradition but it was the first time in 2007 when originally separate rough set events were organized in the same time and place. After five years, the rough set community goes back to this idea - JRS 2012 will be held in Chengdu, China. It will be the only major international rough set event in 2012. Given popularity of rough sets in China and neighboring countries, it has a chance to become the biggest rough set meeting, counting from the first international rough set workshop in 1992.

August is a very good time to visit Chengdu - a historical city surrounded by natural wonders hidden high in the mountains. I still remember Jiuzhaigou located near Chengdu in 2006. (I attach a picture of my wife and mine taken near the valley.) On the other hand, in 2008, Chengdu suffered from a huge earthquake that happened just before another international rough set conference (RSKT 2008). Those of participants who then managed to reach the city could see that the nature can be both beautiful and brutal. Therefore, for many of my friends and colleagues, attending JRS 2012 will mean something more than just yet another academic meeting.

JRS 2012 may turn out very interesting also for researchers and practitioners who have never thought about visiting a rough set event before. As an example, let me briefly describe the Data Mining Competition organized in conjunction with the main conference. The classification problem formulated as the goal of the competition has its origin in very recent experiences of my U of Warsaw colleagues with the analysis of biomedical research papers gathered within the PubMed Central repository. (Here is a link to one of our papers about it.) The data set provided for the competition purposes contains information about 20,000 journal articles (split onto the train and test samples) labelled by 25,640 attributes - numeric columns expressing association strengths of articles to the MeSH ontology terms. The task is to train a classifier that enables to automatically tag research papers with the most appropriate MeSH subheadings - in other words, to automatically classify papers into some predefined topic categories. The classifier with the highest accuracy (measured based on comparison with tags assigned manually by medical experts) wins the competition.

One might say that such data mining competitions are not so interesting for the database people. However, there are at least two important database-related aspects here. First of all, the above-described data set was created using database tools. As you may remember (I used to blog about it a few times), my U of Warsaw colleagues use Infobright, MongoDB and several other technologies to store very detailed information about documents downloaded from various sources. In particular, it enabled the competition organizers to compute more thoroughly the weights of associations between articles and ontology terms. Secondly, it is interesting to discuss whether standard algorithms for classifier learning (for example algorithms aimed at extracting optimal decision trees or SVMs from data) can be still efficient for truly large data volumes and whether SQL-based analytic scripts might be used to speed them up. (I used to blog about it as well.) In case of this particular data set and the corresponding classification problem, it may be actually not a bad idea to use the provided train sample to construct a random forest of trees based on different subsets of attributes. How to heuristically search through the space of all forests using SQL in order to find the most promising classifier ensemble? – A good question for potential competition participants.

Merry Christmas!

Dominik

Infobright     Tags:
Please login or register to post a comment.