Joinutility seperatorLogin utility separator Infobright.com
   
 
ICDE 2009 Day By Day
Posted: 29 March 2009 11:53 AM   Ignore ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Here is the place for “real-time” updates on ICDE 2009 (http://i.cs.hku.hk/icde2009/index.htm). I’ll do my best to report and comment the most interesting presentations and discussions.

Best greetings,

Dominik

Signature 
Profile
 
Posted: 30 March 2009 01:24 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Monday:

Let me start with the keynote talk by Stefano Ceri who introduced the concept of search computing, related to multi-domain queries on the Web. Putting it into simple words, imagine that you want to join two tables that are… the outcomes from two Web services. For example, let’s buy an inexpensive (according to e.g. Amazon) but popular (iTunes) CD. Surely, we might do some “manual” search and combine the results ourselves. However, it’s not like database joins work these days. So how to establish an analogous, sound and efficient framework for data being dynamically retrieved from “somewhere”? The project headed by Professor Ceri leads towards this direction.

Then I listened to two XML-related presentations at the beginning of Industrial Session 1. It’s a shame that I didn’t know SQL/XML before. This SQL extension shows how to think about XML as yet another relational data type. Certainly, by putting XML into relational framework we give up its navigational features. On the other hand, we are then able to analyze it consistently with other data. How about functions that annotate rows with, e.g., strings extracted from XML structures? Well, it sounds simple but - as in case of the keynote talk - the point is how to run extended SQL efficiently. With this respect, I refer especially to the first talk in the session.

Finally, I attended (the last part of) the tutorial on graph mining. All the materials are actually available here. A lot of inspiration! For example, cross-associations in Part IV may be quite interesting in the context our Pack-To-Packs. I was also intrigued by the usage of mapreduce/hadoop to scale the graph mining computations. Well, mapreduce is certainly the topic for a separate post. Let me rather conclude this day in a different way. - I realize that all those graphs, XML documents, and Web queries may look quite odd in this forum. However, isn’t it just a matter of time when we start dealing with less structured, more dynamic data in data warehousing applications?

Best greetings,

Dominik

[ Edited: 30 March 2009 06:30 PM by Dominik Slezak]
Signature 
Profile
 
Posted: 31 March 2009 12:26 PM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Tuesday:

I didn’t sleep well. I kept thinking about examples of using SQL in data mining algorithms. It’s getting one of my hobbies. If it’s your hobby too, please visit this thread. I know it’s nothing fancy, not even close to real-life data problems. However, if appropriately adjusted to practical applications, maybe it could lead to something useful? Anyway, half awaken and half dreaming about SQL, I attended the sessions.

The keynote today was a perfect example of the difference between dreams and reality. David Carlson spoke about one of those scientific megaprojects where nothing is easy, where everything needs to be resolved from scratch, starting with funding and finishing with science. With regards to the project’s goals, I especially liked the aspects of building predictive models. Yesterday we learnt about compound, unstructured and not in place data. Today we listened about data that doesn’t even exist yet. (I know that people usually talk about prediction models in a different style, but isn’t it tempting to describe it like this?) By the way, just to digress a bit, I remember a very nice paper about processing forecasting queries. Simple and bright idea… Nevertheless, for Dr Carlson and other International Polar Year participants simple and bright ideas are not enough. They need to adapt them to solve highly complex tasks, over highly complex data.

The next item in today’s schedule was the awards ceremony. The best ICDE 2009 paper is: “Histograms and Wavelets on Probabilistic Data” by Graham Cormode and Minos Garofalakis. It was presented in Uncertainties session. Well, I decided to attend Transactions session which was held in parallel. However, I’ll surely get back to this paper. It’s perhaps too early to talk about it, but there may be some analogies between processing uncertain data and the way we do rough computing on our Infobright’s Knowledge Grid. Please have a look at, e.g., February posts on our academic blog with this respect.

The best student paper is: “Double Index Nested-loop Reactive Join for Result Rate Optimization” by Mihaela Bornea, Vasilis Vassalos, Yannis Kotidis and Antonios Deligiannakis. I attended Query Optimization session where the paper was presented. (Although Data Mining 1 session was held in parallel. What a pity.) The speaker referred to several good papers on non-blocking join algorithms. It’s wonderful that further improvements are still possible. The session included also other interesting presentations about adaptive query processing and optimizations based on sampling and statistics. Again, I should probably refer to our Knowledge Grid. But that would be too much about Infobright in a single post…

Instead, let me go back to the awards ceremony. The most influential paper award goes to Kin-Pong Chan and Wai-Chee Fu, for the paper titled “Efficient time series matching by wavelets”. The original idea of conducting similarity search by transforming highly complex time series data into a space spanned over relatively small amount of wavelet-based dimensions has been extended in many ways. It was a great pleasure to listen to this presentation. Let me sincerely follow the speaker in her last sentence: I wish everyone of you a very good time series!

Best greetings,

Dominik

[ Edited: 31 March 2009 12:41 PM by Dominik Slezak]
Signature 
Profile
 
Posted: 01 April 2009 01:08 PM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Wednesday:

Keynote #3: Data Management in the Cloud by Raghu Ramakrishnan. The speaker could not physically attend ICDE 2009. However, he connected and spoke from USA. It was a kind of lecture from the cloud. And it wasn’t bad at all! Actually, we’re organizing together with some of my academic colleagues a couple of small rough sets / soft computing events in India this year. I’m not able to attend all of them. Thus, I’ll do my best to speak from the cloud as well.

Data in the Cloud is a very hot topic. We’re doing something with it at Infobright too. (Let us refer, e.g., to our community blog.) However, compared to Yahoo! it’s just kindergarten. In his talk, Dr. Ramakrishnan introduced - step by step - the main features, applications, and components of the cloud solutions. It was extremely useful to listen about data serving vs. data analysis, clouds being functional (check out Search Monkey and BOSS!) and horizontal, details related to data consistency et cetera – all in one lecture!

After keynote, I attended session Query Processing 1. I particularly liked presentation of the paper titled “Space-Constrained Gram-Based Indexing for Efficient Approximate String Search”. Grams are sub-sequences of fixed length. For example, 2-grams will represent pairs of consecutive characters. The idea is to use grams both to express similarities between strings and to build inverted indexes strings in the database. Given that such indexes would be too large, the authors compress them. Precisely, the authors decrease the index size by (roughly speaking) removing or combining some of the grams. On the one hand, one can regard it as a lossy index compression. On the other hand, the authors show that it does not lead to a decrease in querying exactness and even increases the speed.

I like this paper also because the way of reasoning is somewhat analogous to what we’re doing in ICE. Recently, we even considered adding some simple gram-based structures to our Knowledge Grid. However, there are no satisfactory results yet. (For the existing structures we refer, as usual, to VLDB 2008.) Certainly, in our case it’s a different scenario. We “index” at the level of larger packs of values. Also, we’re not (yet!) into similarity searches. Nevertheless, I’ll introduce this paper in detail to my colleagues at Infobright.

Finally, I attended the last part of the tutorial on Preference Queries from OLAP and Data Mining Perspective. Again, a lot of valuable material! (The authors intend to make the materials available online soon.) I was especially interested in problems related to searching for minimal Satisfying Preference Sets. The idea is to ask the users to identify some most preferred (superior) and least preferred (inferior) examples from available data and to heuristically find possibly smallest subset of available preference criteria that would justify the users’ choice. It partially resembles the preference reduction methodology developed within the framework of rough sets. I referred to this approach in rough set thread (#9). However, the case of such partially defined target preference attribute was not considered there.

Best greetings,

Dominik

[ Edited: 01 April 2009 01:14 PM by Dominik Slezak]
Signature 
Profile
 
Posted: 01 April 2009 01:26 PM   Ignore ]   [ # 4 ]  
Jr. Member
Avatar
RankRank
Total Posts:  54
Joined  2008-08-18

Dominik,

This is great stuff.  I was a bit surprised to see this in the forum because i would have thought you would put all of this in your blog.  you should think about posting this to your blog.  people love it when you go to a conference and you provide a daily blog post ‘from the conference’ talking about things you think are interesting.  and it is good in your blog to include url links and pointers to interesting speakers.  especially as you talk about cloud also.  mir

Signature 

Miriam G. Tuerk
(JavaScript must be enabled to view this email address) | http://www.infobright.com

Profile
 
Posted: 01 April 2009 02:35 PM   Ignore ]   [ # 5 ]  
Administrator
Avatar
RankRankRankRank
Total Posts:  519
Joined  2008-07-08

It is on his blog.  I’ve been putting it there to save him time from cross posting to a bunch of places.

Signature 
Profile
 
Posted: 01 April 2009 06:51 PM   Ignore ]   [ # 6 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Miriam, thanks for kind words. Mark, thanks for help. The last day update is on its way. I’ll put it both here and on the blog. The forum thread will be a good place for discussion. On the other hand, the updates look nice on the blog as well.

Best greetings,

Dominik

Signature 
Profile
 
Posted: 02 April 2009 09:42 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Thursday:

I attended the sessions on Data Integration and Warehousing and Data Mining 3. All presentations were very interesting. I focus on two of them but I’ll get back sooner or later to the others. Actually, since this is the last day, let me discuss the first presentations.

The first paper in the Data Integration and Warehousing session – “Aggregate Query Answering under Uncertain Schema Mappings” by Avigdor Gal, Maria Vanina Martinez, Gerardo Simari and V. Subrahmanian – shows that inexact data is not the only source of uncertainty in data processing. The paper deals with uncertainties related to inconsistent schema specifications that may occur during data integration. Some examples and techniques mentioned during the talk referred to the interval data analysis, which is very close to my heart. On top of that, the paper concentrates on aggregations that are quite typical for data warehousing. Given uncertainty at the level of data interpretation, the authors compute estimate results of aggregations in an exact way. It means, e.g., that the min/max bounds for the query results (the authors consider also probabilistic distributions and expected values) are fully certain. On the other hand, the authors intend to develop more approximate algorithms in the future. This is analogous to the ICE-related discussions about approximate querying. Currently, the ICE engine uses rough information to compute exact bounds, which are dynamically applied to re-optimize the process leading towards the final answers. However, more approximate calculations may lead to faster querying, with the results that are “precise enough”.

The Data Mining 3 session started with the paper titled “Another Outlier Bites the Dust: Computing Meaningful Aggregates in Sensor Networks” (what a title!) by Antonios Deligiannakis, Yannis Kotidis, Vasilis Vassalos, Vassilis Stoumpos and Alex Delis. The authors deal with yet another form of uncertainty – noisy measurements that may lead to biased, unreliable query results. This paper truly opened my eyes! In one of my previous posts, I described the basic idea to replace the exact min/max bounds for particular data packs stored in ICE with their “slightly cheated“ versions that would represent “99%” of data. However, I was worrying that such “cheating” might lead to something too distant from precise answers. Now it turns out that in some cases such imprecise answers may be more valuable than the precise ones. Of course the procedure of identifying outliers needs to be carefully adjusted to the given application (so the above-mentioned “99%” strategy needs a lot of improvements!). Honestly, I wish to thank the speaker for inspiration. How could I forget about such types of applications?!

Best greetings,

Dominik

PS: This is the last post for now but I hope some discussion will follow…

Signature 
Profile