Hello and welcome in 2010!
This time I’m still going to write partially about 2009 but with some thoughts about 2010 as well. J
First of all, let me finish writing about my academic odyssey in December. After Malaysia and Korea, I attended conferences in Delhi and Tumkur. In both places, I enjoyed talking and answering questions about Infobright’s technology, with some interesting remarks from the Artificial Intelligence and Soft Computing experts. I also enjoyed other presentations and discussions, for example the meeting with people from TCS, as well as wonderful cultural program, especially the music!
The most inspiring keynote talk was on Granular Computing. It was delivered by Dr. Andrzej Skowron, who used to be my PhD supervisor, who is my all-time scientific mentor and the head of research team that I’m working with at the University of Warsaw. (Yes, now I’m also back to academia!) Still, I learnt something new. You often need to travel across continents to finally get something you hear every day.
You’ll find references to Granular Computing at Infobright’s page in Wikipedia. It’s about data granules (like our data packs) and information granules (like our knowledge nodes). It’s about computing with information granules instead of dealing with single data points. Granules are usually formed basing on various kinds of domain knowledge, for example ontology in search engines, variable resolutions in image analysis, et cetera. With Infobright, the formation of granules is perhaps not as sophisticated (JJ) but overall methodology remains the same!
I’m sure I’ll go back to Granular Computing at least several times this year. But now let me finish with 2009. Many academic projects have been finished successfully. The above-mentioned conferences concluded the Rough Set Year in India. International Rough Set Society starts 2010 with the newly elected executive members. But not everything was so successful. For example, we have put a lot of effort in preparing the rough data contest in Delhi. However, we didn’t clarify the contest rules well enough. Also, some people claimed that the provided rough data (knowledge nodes; aka information granules) were of insufficient quality to mine any patterns or dependencies. We will need to improve ourselves very quickly…
But let me conclude with something optimistic. Thanks to Dr. Skowron’s generosity, here are handouts from his talk (careful! - big pdf).
And… I attach one of the conference pictures. – Guess where I am!
Best greetings,
Dominik
Read Comments (1)
Hello,
This time I’m going to write about Korea and more precisely about the Future Generation Information Technology (FGIT 2009) conference that was held on Jeju Island last week. I’ve been to Jeju two times before but it still strikes me with its special charm. Well, I realize that this blog should be more about technology than travel but let me at least refer you to some nice pictures of the island and the venue.
The idea of this conference relates to the concept of Hybrid Information Technology, considered a few years ago together with some of my colleagues in Korea. Nowadays, they are holding their own Hybrid IT related academic events (see also ICHIT and ICCIT) with a high number of paper submissions (in the case of FGIT 2009 it exceeded 1100 articles). It shows a huge practical need for interdisciplinary research in IT.
FGIT is a multi-conference that consists of 10 events with their own committees and materials, but unified with respect to organization, keynotes and the special volume of materials containing the most representative papers accepted to particular events. The special volume is published within LNCS series (probably quite familiar to academic readers). The remaining papers can be found in 10 CCIS volumes. The CCIS series is a new Springer’s initiative, a good complement to more established Lecture Notes on Computer Science, Artificial Intelligence, and Bioinformatics. In my opinion, CCIS may soon become more attractive to industry and practitioners than in the case of its older LNCS/LNAI/LNBI brothers.
However, let’s leave behind all those complicated publishing strategy details and focus on the contents. Let’s go back to keynotes and cite at least two out of nine invited speakers:
The Era of Social Computing (Irwin King): The Web has changed the landscape of how humans interact socially. With the advent of Web 2.0, Social Computing has emerged as a new and innovative paradigm that changes the way we communicate, interact, and learn. Social Computing involves the investigation of collective intelligence by using computational techniques such as machine learning, data mining, natural language processing, etc. on social behavioral data collected from blogs, wikis, emails, instant messages, clickthrough data, query logs, social bookmarks, tags, etc. In this talk, I will first introduce Social Computing by outlining some of the unique characteristics and aspects that are found on the various social platforms. Applications in each of the platforms will be presented to further demonstrate the use of these new technologies to enhance and enrich our lives. Lastly, I will conclude with some current challenges and potential future promises of Social Computing.
The Ubiquitous DBMS (Kyu-Young Whang): Recent widespread use of mobile technologies and advancement in computing power prompted strong needs of database systems that can be used in small devices such as sensors, cellular phones, PDA, ultra PCs, and navigators. We call database systems that are customizable from small-scale applications for small devices to large-scale applications such as large-scale search engines ubiquitous database management systems (UDBMSs). In this talk, we first review requirements of UDBMSs. The requirements we identified include selective convergence (or “devicetization”), flash-optimized storage system, data synchronization, supportability of unstructured / semi-structured data, and complex database operations. We then review existing systems and research prototypes. We first review the functionality of UDBMSs including the footprint size, support of standard SQL, supported data types, transactions, concurrency control, indexing, and recovery. We then review the supportability of requirements by those UDBMSs surveyed. We highlight ubiquitous features of a family of Odysseus systems that have been under development at KAIST for over 19 years. Functionalities of Odysseus can be “devicetized” or customized depending on the device types and applications as in Odysseus / Mobile for small devices, Odysseus / XML for unstructured / semistructured data, Odysseus / GIS for map data, and Odysseus / IR for large-scale search engines. We finally present research topics that are related to the UDBMSs.
The first of the above keynotes was particularly interesting to listen because of the nature of the data underlying the social computing challenges. Indeed, comparing to my previous blog, we face here one more example of Log Analytics, now related to Web 2.0 applications. Some people may actually call it Web Analytics or Social Analytics, given some specific tasks like, for example (this is my favorite one!), Social Marketing that aims to target the best-connected nodes (most-influential individuals) within the social network. Surprisingly (or not at all), quite analogous tasks can be formulated for other types of complex network-related data sets, for example, the data related to the growth of infectious diseases (see another keynote by Peter M.A. Sloot). As a result, one of the future directions may be abstraction of complex network data models and the most typical queries against them. However, regardless of whether we use classical relational data models and something more specific for network-related data, query complexity will be still there. Actually, I find most of the tasks of Social Computing (and Social Marketing in particular) as the ones corresponding to quite mixed, ad hoc, complex query workloads.
The second of the above keynotes corresponds to the question to what degree database engines should be tunable with respect to available resources. Look at all those new storage devices and their specific performance bottlenecks. Is it possible to create a database architecture that can efficiently work with (adapt to) multiple heterogeneous resources at the same time? Is it possible to create an architecture consisting of roughly two layers: one responsible for truly abstracted execution of database operations and another one hiding away the mechanisms of adaptation to particular types of resources? It’s surely something more than traditional meaning of being “oblivious” (e.g. cache oblivious). I hope it’s doable. But it looks like just the first part of UDBMS story –gathering all information sources and computational resources to retrieve useful knowledge. The second part is what kind of knowledge is worth retrieving, given potentially limited data access and incompleteness? As an example, one of my colleagues in Italy has conducted a very interesting research on approximate reporting on handheld devices. Of course at least some of you already know that approximate querying is one of my favorite topics but I’m trying to keep myself unbiased here. – I honestly believe that it is something natural within UDBMS framework.
Last but not least, the conference is not only about keynotes but also about regular paper presentations, so we did our best to contribute and we submitted two papers. The first paper, co-authored by Marcin Kowalski, is about such data organization that improves the quality of our Knowledge Grid. I blogged about it several times before, so let me briefly refer to the paper and to Infobright’s Wikipedia page, where you can find some relevant links. The second paper, co-authored by Hiroshi Sakai, is a perfect example of how the forum discussions can evolve into scientific cooperation. The paper is about SQL-based mining of non-deterministic data. It’s just a preliminary work but I hope you will like it.
To conclude, I hope that I convinced you to look more carefully at the idea of Hybrid IT and the place of databases within it. I’ll be looking forward to your comments and… do not forget about FGIT 2010!
Best greetings,
Dominik
Hello,
I should start with a short announcement: Malaysia is a beautiful country and Melaka is a beautiful city. I have a friend, who moved from Poland to Malaysia. He keeps repeating: Dominik, do not tell anyone how nice Malaysia is, because then everybody would want to move here. Well, I hope he is not going to read this post… Anyways, attending the SoCPaR 2009 conference at the UTeM was a wonderful experience and I wish to thank the organizers one more time for their great work.
The city of Melaka has been influenced by many nations and cultures. In the same way, SoCPaR is a mixture of diverse aspects of soft computing and pattern recognition. I enjoyed that mixture a lot, listening to invited talks, participating in the panel, and explaining how our technology relates to the conference topics. Out of many presentations, I remember particularly well the one on Collaborative Security Mechanism in Detecting Intrusion Activity. This is because Intrusion Detection is based on Log Analytics – one of our favorite topics at Infobright.
Cyber security-related analysis of logs already has a well-established foundation with respect to data modeling, processing, and mining. In one of my favorite introductory books on this topic, you can find nicely outlined relational schemas related to the log structures and the types of attacks. However, as also discussed in the above-mentioned SoCPaR talk, there may be several types of logs, each of them with a different corresponding data model. Detection of some of these attacks requires looking at all these types of logs in a synchronized way, usually by comparing timestamps. Furthermore, there are two classes of attacks: Fast attacks are detectable within a very short time interval and they are often addressed by the tools of data stream analytics; Slow attacks, on the other hand, correspond to the patterns that can develop over a couple of days, so you need to query quite a lot data (it grows really fast), which is a challenge even if you know exactly what you are looking for (please refer to the differences between specification-based and anomaly-based aspects of intrusion detection).
Of course slow attacks need to be detected as quickly as possible. So, as a summary, there are massive, rapidly growing amounts of log data, to be queried on nearly real-time basis, multi-table (but nothing like star schema if you consider several types of logs), in a mixed, canned / ad-hoc way. Wow! Maybe this is why people claim that slow attacks are harder to be efficiently detected than the fast ones. J
Best greetings,
Dominik