According to wikipedia, data mining is the process of sorting through large amounts of data and picking out relevant information. Further (see wikipedia for some data mining books), it can be understood as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data or, in other words, the science of extracting useful information from large data sets or databases, to mention just a few flavors...
The question remains: Who is conducting the data mining process?
Last week, I participated in a sales call. Asked for the types of queries that we perform best in comparison to others, I talked about complex analytics involving multiple joins and aggregations, combined with ad hoc, unpredictable constraints. Then there was a question: So you are supporting data mining!? Well, I didn’t answer immediately. However, I recalled that data mining can be interpreted as the process conducted by experts working with a data warehouse designed for ad hoc analytics. The experts are allowed to generate the craziest possible exploratory queries expressing their hypotheses about patterns hidden in the data. Our job is then to provide satisfactorily fast execution of such queries. Hence, yes, we are supporting data mining in this sense.
On the other hand, given my academic background prior to Infobright, I got used to a slightly different definition of data mining. Basically, it’s about the same kind of process, as the one described above. The difference is that instead of the team of human experts there is a kind of artificial entity, a kind of intelligent algorithm that generates the hypotheses by itself, verifies them against the data and reports only the best findings to the end-user (who should be, of course, still able to interact and to formulate his/her own thoughts). The comparison of those two interpretations gets even more interesting when we look at some of the current trends in data mining research, that are to rewrite the classical data mining methods to work with automatically generated SQL aggregations instead of the plain data directly (which becomes pretty tough for really large data volume). Actually, when speaking with my academic friends, this is one of my suggestions on how to use ICE. Our job is then to support such processes in exactly the same sense as discussed above!!
Well, there is still one more question: What is the goal of data mining?
In both of the above scenarios, the goal is the same – given the data records corresponding to some subjects of analysis (business, scientific, etc., depending on the application area), we want to extract the most practically meaningful, most helpful knowledge. However, this is not the only possibility. When reading recent scientific papers and patent applications, I can see a growing interest in automating the physical data model tuning. In my opinion, this is the case of data mining too, with a modified form of input data, expected results, and overall objectives. The input to the data mining process is here not only the data but also, for example, the samples of the query workloads that we can expect. The expected results are now formulated in terms of the best possible settings of the physical data model parameters. So, generally, this brings us to the third face of data mining in relation to databases and warehousing: Conducting (automatically or semi-automatically) the data mining processes in order to optimize the performance of the given database engine.
This third face of data mining is, in particular, clearly visible in Infobright’s technology. The basic idea is to automatically compute meaningful knowledge creatures (we call them knowledge nodes, precisely) that represent the data. Each such creature is to be evaluated by means of three parameters: its usefulness in query optimization and execution, its size (the acceptable size of all maintained creatures should not exceed 1% of the size of compressed data), and the ease of updating when the data is changing. The goal of the data mining process hidden inside the Infobright engine is to find the most useful creatures under constraints related to the size and ease of update. I must say that in my own research, looking for the ways of adapting classical data mining methods in order to enrich the variety of possible types of helpful knowledge creatures is one of the most exciting tasks!! I hope that others may find it as interesting as well and that we will be able to work together to extend Infobright’s architecture in this respect.
Best greetings,
Dominik
Hello David,
Good introductory book indeed. I like Chapter 3—“data mining and the data warehouse”.
This is one of many publications illustrating that my post should be the beginning rather than the end of discussion. There are certainly more faces of data mining when considered in the context of databases and data warehouses. For example, let’s have a look at the section about “learning as compression of data sets” (page 111). I’m surely going to follow up on this one…
Best greetings and thanks again,
Dominik
Dominik,
One of the best, and shortest, texts I’ve read on data mining is titled simply ‘Data Mining’ by Pieter Adriaans and Dolf Zantinge. It was published in 1996 before many of the methods and objectives of data mining had been coded into commercial (non-governmental, non-military) software.
Post Comment