Joinutility seperatorLogin utility separator Infobright.com
   
 
Data Mining in Warehousing
Posted: 07 November 2008 07:27 AM   Ignore ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

I tried to summarize possible ways of thinking about data mining in the context of data warehousing:

http://www.infobright.org/Open-Source/Blog/Academic

It’s certainly just a starting point for discussion. My observations are rough rather than exact.

I’ll be grateful for the comments on this topic… On my side, I’ll follow up with some examples soon…

Best greetings,

Dominik

Signature 
Profile
 
Posted: 26 November 2008 08:46 PM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

So maybe let’s start with something widely known…

How about calculating frequent itemsets using SQL? Consider the famous Apriori algorithm:

http://en.wikipedia.org/wiki/Apriori_algorithm

On the other hand, consider a piece of SQL that does the same, attached here as a screenshot from one of roughly a decade old articles.

(When you compare the wiki page with the attachment, just don’t bother with little differences in notation, like L and F or epsilon and minsup, treat items as the attributes and tid as an identifier of the rows in the data table T you are calculating the itemsets from.)

Sounds cool and easy… However…

1. I can imagine that research in the area of SQL support for fast calculation of frequent itemsets and association rules has evolved significantly since publication of that article. So, what is roughly a state of the art now?

2. I can imagine that such SQL queries are quite hard to execute over truly large amounts of data and that sampling is not always a solution. So, which database engines seem to provide the best performance for such queries?

I must admit this specific topic is pretty new to me. Hence, I’ll be very grateful for comments!!

Best greetings,

Dominik

[ Edited: 27 November 2008 04:05 PM by Dominik Slezak]
Image Attachments  SQL ITEMSETS.JPG
Signature 
Profile
 
Posted: 30 March 2009 02:05 AM   Ignore ]   [ # 2 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello,

One of my academic colleagues is interested in generating association rules from non-deterministic information systems, i.e., roughly speaking, data tables where values of rows for some attributes may take a form of sets instead of single elements.

It certainly motivated me to revisit my feelings about SQL-based computation of frequent itemsets. Below I’m still considering deterministic case because, otherwise, it would get too complicated too quickly. However, you’ll hear about non-deterministic information systems again pretty soon.

First of all, I’m not talking any more about the classical basket analysis, as in the previous post in this thread. We are now interested in building itemsets (patterns, templates) using arbitrary frequently occurring attribute-value combinations. Accordingly, we need to change a way of representing data. Please refer to the attachment with this respect.

The query discussed in the previous post can be now rewritten as follows:

insert into F_k 
   select      attr_1
val_1,attr_kval_kcount(*) as supp
   from        C_k
T t_1,,T t_k 
   where       t_1
.attr C_k.attr_1 and 
                   
t_1.val C_k.val_1 and
                   
… 
                   t_k
.attr C_k.attr_k and
                   
t_k.val C_k.val_k and
                   
t_1.rid t_2.rid and 
                   ... 
                   
t_k-1.rid t_k.rid 
   group by    attr_1
val_1,attr_kval_k
   having      supp 
>= minsup

wherein:
T is the analyzed data (see the attachment);
rid – row identifier;
attr – attribute identifier;
val – value identifier;
C_k – table with candidate k-itemsets;
F_k – table with frequent k-itemsets;
attr_1, val_1,…, attr_k, val_k – columns in C_k and F_k;
minsup – minimum itemset support.

Let’s actually start with something basic and generate F_1. Let’s call it simply F:

insert into F
   select      attr
valcount(*) as supp
   from        T 
   group by    attr
val
   having      supp 
>= minsup

Now, in order to compute F_2 according to the above scenario, we would need to prepare C_2 first. It looks simple. It’s about putting (attr,val) pairs in F into quadruples (attr_1,val_1,attr_2,val_2). We just need to avoid the cases of attr_1=attr_2 and repetitions (attr_2,val_2,attr_1,val_1).

On the other hand, wouldn’t it be nice to have a query that doesn’t require such an intermediate step related to C_k? Please check out the following version and let me know what you think. We’re going to use the above-derived F (F_1) to calculate F_k directly from F_k-1. Also note that additional condition F_k-1.attr_k-1 < F.attr enables us to avoid unwanted repetitions and invalid itemsets. On the other hand, the result set is complete and nicely presented with respect to attributes used.

insert into F_k 
   select      F_k
-1.attr_1F_k-1.val_1,F_k-1.attr_k-1F_k-1.val_k-1F.attrF.val,
                   
count(*) as supp 
   from        F_k
-1FT t_1,,T t_k 
   where       F_k
-1.attr_k-F.attr and
                   
t_1.attr F_k-1.attr_1 and 
                   
t_1.val F_k-1.val_1 and
                   
… 
                   t_k
-1.attr F_k-1.attr_k-and
                   
t_k-1.val F_k-1.val_k-and
                   
t_k.attr F.attr and
                   
t_k.val F.val and
                   
t_1.rid t_2.rid and 
                   ... 
                   
t_k-1.rid t_k.rid 
   group by    F_k
-1.attr_1F_k-1.val_1,F_k-1.attr_k-1F_k-1.val_k-1F.attrF.val
   having      supp 
>= minsup

Best greetings and looking forward to comments,

Dominik

[ Edited: 31 March 2009 05:10 AM by Dominik Slezak]
File Attachments 
data example.pdf  (File Size: 27KB - Downloads: 310)
Signature 
Profile
 
Posted: 31 March 2009 05:19 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello Again,

I just can’t stop thinking about the power of such SQL-based approaches. So let me move a bit further.

Once frequent itemsets that satisfy a predefined minimum support threshold minsup are generated, the next step in classical approach is to use them to calculate IF-THEN rules that additionally satisfy the minimum accuracy threshold minacc. Let me focus on a special case of such rules with the fixed decision attribute. Extraction of such decision rules from data is a popular task in machine learning and also in rough set approaches to classification, prediction, etc.

Consider the attached data table. (The original table is the same as in the previous post.) Imagine that decision attribute is the last one, so possible decisions are Yes and No. An example of decision rule with support equal to 3/14 and accuracy equal to 2/3 is as follows:

IF Temp. = Hot AND Humid. = High THEN Sport? = No

For various reasons (clarity of description, applicability to future cases) we would like to keep rules as short as possible. Certainly, getting rid of Temp. = Hot or Humid. = High would simplify the above rule. However, accuracy may behave quite unpredictably. On the other hand, if the accuracy of 2/3 is not satisfactory, we may try to add something to Temp. = Hot AND Humid. = High. However, it may decrease the support too much. All together, the task of generating irreducible (with respect to their left sides) decision rules that satisfy predefined support and accuracy thresholds is quite a cool task.

(One may also consider coverage instead of support. It makes sense for data with imbalanced decisions. No problem with appropriate modification of SQL framework presented below.)

Let’s go back to the attachment again. Notice that table T does not contain information about attribute Sport? any longer. Instead, decisions Yes and No are represented in a separate table C. Certainly, it’s not the smartest possible way of rearranging the data. However, it’s good enough for illustration purposes.

We are going to calculate rules for each of decisions separately. Consider the following query for Yes:

insert into D_k 
   select      F_k
-1.attr_1F_k-1.val_1,F_k-1.attr_k-1F_k-1.val_k-1F.attrF.val,
                   
count(*) as suppcount(D.Yes)/count(*) as acc
   from        F_k
-1FDT t_1,,T t_k 
   where       F_k
-1.attr_k-F.attr and
                   
t_1.attr F_k-1.attr_1 and 
                   
t_1.val F_k-1.val_1 and
                   
… 
                   t_k
-1.attr F_k-1.attr_k-and
                   
t_k-1.val F_k-1.val_k-and
                   
t_k.attr F.attr and
                   
t_k.val F.val and
                   
t_1.rid t_2.rid and 
                   ... 
                   
t_k-1.rid t_k.rid and
                   
t_k.rid D.rid
   group by    F_k
-1.attr_1F_k-1.val_1,F_k-1.attr_k-1F_k-1.val_k-1F.attrF.val
   having      supp 
>= minsup

At each k-th step of decision rule generation, table D_k should be further used as follows:

1. Every row that corresponds to a rule such that acc >= minacc should be sent to FINAL RESULT

2. Every row that corresponds to a rule such that acc < minacc should be sent to table F_k for the next iteration

The whole process stops when there is nothing left to be sent to F_k. FINAL RESULT will then contain all irreducible decision rules (for decision Yes) such that supp >=minsup and acc >= minacc.

Isn’t it beautiful?

Best greetings,

Dominik

[ Edited: 31 March 2009 05:22 AM by Dominik Slezak]
File Attachments 
data example revised.pdf  (File Size: 28KB - Downloads: 292)
Signature 
Profile
 
Posted: 05 April 2009 12:27 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-01-29

Dominik,

From the perspective of mining customer transactions, the problem identifying frequent items sets across market baskets is challenging. The difficulty of the problem usually is reduced through pre-selection of the products and restriction to purchase pairs. These substantial simplifications are possible because marketers usually are focused on categories and brands in which they compete and, most importantly historically have relied on simplistic ranking the paired items - whether categories and brands - according to the averages of basket incidence or % of contribution.

Interestingly, some categories tend to exhibit multiple different product purchases within the same brand and the same basket - e.g. multiple flavors -while other do not. Extending mining of market basket from purchase pairs to frequent item sets of interest offers the potential to enrich the available information value to marketers to support decisions on SKU promotion and rationalization.

A closely related problem to market baskets is analyzing cross-purchasing over time. While quite similar mathematically, the cross-purchasing relaxes the basket constrain and analyzes similar measures averaged across all baskets within the desired time period.

Profile
 
Posted: 06 April 2009 06:54 AM   Ignore ]   [ # 5 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello BInnovator,

Thanks a lot for comments!

Indeed, this is also how I look at applicaction of intelligent algorithms in data mining. On the one hand, perhaps 90% (or even 95%) of data mining tasks can be supported by the expert-database dialog. On the other hand, there are some specific tasks where additional (often AI-based) algorithmic support is needed. Basket or, e.g., cross-purchasing analysis are good examples. It is not so often the case that analysis of pairs of products etc. is not sufficient to draw business conclusions. Sometimes, however, there are no significant pair-based relationships in data. Sometimes information is hidden in more complex dependencies. Then it is unfair to ask the experts to analyze exponentially growing number of potentially interesting patterns. Then some heuristics searching through data are welcome to assist the experts with providing subsets of potentially most valuable hypotheses.

From more technical perspective, it is interesting to observe that such additional algorithms have been usually based on data sampling. However, samples are often hard to maintain as statistically representative. Moreover, for many real-life problems, the currently existing algorithms cannot work even with samples unless they include unacceptably low percentage of data. This is why people often try to translate classical implementations of data mining algorithms to let them work more deeply with SQL. It’s not always possible. However, I hope we can work it out in some cases.

Best greetings and thanks again,

Dominik

[ Edited: 06 April 2009 06:57 AM by Dominik Slezak]
Signature 
Profile
 
Posted: 06 April 2009 12:42 PM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  8
Joined  2009-01-29

Dominik,

For data mining I believe that SQL algorithms certainly complement other non-SQL algorithms (utilize SQL-sourced data as input).  Regardless, all compute-intensive data mining algorithms are constained by space available for high-speed processing in-memory.  Hence, I agree that data reduction techniques, including sampling or fingerprinting, are necessary utilize most common data mining algorithms.

Since many of the interesting findings come from analyzing longitudinal behavior of individual customers or portfolios comprising segments of them, the choice of data reduction techniques becomes one of selecting a representative set of customers across one or more of these segments.  Of course, for that purpose statistical sampling - whether random or frequency- is the commonly employed technique.

The statistical sampling approach affords two important benefits: (1) well-understood and widely-accepted model of reality- business managers prefer and place more trust in models that they understand and are usually comfortable making strategic decisions using 5-10% samples; and (2) easily tunable via parameters to address diverging business needs - addresses the spectrum of potential data reduction requirements.  Since customers usually are classified based on customer histories into matrices of segmentations - defined by multiple criteria, each bucket typically comprises only 2-4% of customers.  To illustrate, 1,000,000 customers would average 20-40K per bucket and 5-10% of those would range from 1K-4K. 

Regarding IB differentiation, the 65K rows of DPs in KN enables a key optimization of sampling customers across time because the scope of each customer’s transaction set is likely much less than a single data pack, and hashed Customer_ID values that are distributed evenly across all customers as with standard approaches to datawarehousing. We could take advantage of this observation even without querying from the KG.

For example, given the fact table’s transactions are ordered first by Customer_ID and then by datetime, sampling customers by selecting hashed Customer_ID values from Start value to Start value + N offset, where the Start value is selected randomly iterated and the N offset is appropriately elected multiplier.  If customers average 10 products per transaction and make 100 transactions per year the DP’s would comprise approx. 6,554 transactions for 65 customers.  Thus, various alternatives of these paremeters could yield a sample of 1000 customers, such as:

A. (Start value, N offset): 1000, 0
B. (Start value, N offset): 100, 10
C. (Start value, N offset): 10, 100

Although these options result in similar sample size, progressing from options A to C samples from increasingly more concentrated DPs. Alternative A requires as many as 1000 DPs while alternative C requires as few as 10. Similarly, for DPs of each fact column - dollars, units, etc.  Ultimately, attaining the optimal I/O efficiency requires organization of the data set such that transactions of samples keys are stored close together. Otherwise, I/O cost could be expected to grow proportionately with the desired sample size. 

Although one approach is pre-sampling into separate fact table, this implies less flexibility in available filtering criteria and shifts the management of filtered subsets to the analytic application.  However, the database offers significant advantages for managing the sampling and filtering requirements. IB would exhibit maximal I/O performance by minimizing the number of DPs involved in sampling, and leveraging the unique statistics of KN likely would accelerate and enrich the potential information value gained from data mining.

Profile
 
Posted: 07 April 2009 06:06 AM   Ignore ]   [ # 7 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello BInnovator:

It’s always a great pleasure to discuss with you!

BInnovator - 06 April 2009 12:42 PM

For data mining I believe that SQL algorithms certainly complement other non-SQL algorithms (utilize SQL-sourced data as input). Regardless, all compute-intensive data mining algorithms are constained by space available for high-speed processing in-memory.  Hence, I agree that data reduction techniques, including sampling or fingerprinting, are necessary utilize most common data mining algorithms.

The same data mining tasks can be addressed by different algorithms. Depending on the choice of algorithm, one can base on (samples of) data in memory, on the results of automatically generated SQL statements, or on both. Data samples can be even generated dynamically during the algorithm’s execution. I heard about such hybrid algorithms that start with a broad search based on SQL but then switch to detailed analysis of samples of data subspaces that are identified as the most interesting ones. 

Consider the task of generating decision rules, used as a case study in this thread. One can use the algorithm based on SQL (I still need to refine it) or widely known algorithms based on data in memory. The results may differ if we operate on samples in the latter case. They may differ even more if we are not necessarily interested in all rules but rather attempt to search for those most interesting ones. The in-memory-sample-based and SQL-based heuristic implementations may then provide us with different sub-optimal solutions. The speed of both types of algorithms may differ too. Everything depends on the size of (samples of) data and the performance characteristics of the database platform that we use for SQL-based analysis. (I believe ICE will be a good one!)

Nevertheless, I fully agree that the culture of sampling is crucial becacuse many of the advanced data mining tasks need to be solved by algorithms based on data sample analysis, at least to some extent. I highly appreciate your opinion, supported by very valuable examples, that ICE may be a good framework for efficient sampling both with respect to speed and quality. As we discussed in the approximate querying thread, it should be possible to implement SQL-based intelligent data sampling in ICE in future. (Hopefully sooner than later.) Once we do it, ICE will provide a good background for both SQL-based and sample-based data mining algorithms!

Many thanks and best greetings,

Dominik

[ Edited: 07 April 2009 06:09 AM by Dominik Slezak]
Signature 
Profile
 
Posted: 07 April 2009 03:30 PM   Ignore ]   [ # 8 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Some people say that “the better is the enemy of the good” but I was still not satisfied with solution proposed in reply #3. It works but we needed to store data in two structures and run the algorithm for each decision value separately. The following strategy needs a single data structure T (like in reply #2) and it enables to deal with all decision values in the same time. It’s difficult to say whether it’s going to run faster than the procedure in reply #3. It would be interesting to double check it on various platforms.

So let’s find all the irreducible (with their left sides impossible to shorten) decision rules with pre-defined constraints on support (minsup) and accuracy (minacc). We work with table T as in the attachment to reply #2 (not #3). We start with producing the table C of conditions. Surely, we also need to set up the decision attribute DEC that we want to describe using the others.

insert into C 
   select      attr
valcount(*) as supp
   from        T
   where       attr 
!= DEC 
   group by    attr
val
   having      supp 
>= minsup

Now, let’s create table D_1 which stores the candidate decision rules with single conditions:

insert into D_1 
   select      C
.attrC.valt_dec.val as decval,
                  
count(*) as suppcount(*)/min(C.supp) as acc
   from        C
T t_1T t_dec 
   where       t_1
.attr C.attr and
                   
t_1.val C.val and
                   
t_1.rid t_dec.rid
   group by    C
.attrC.valt_dec.val
   having      supp 
>= minsup

Everything seems to be intuitive except min(C.supp). However, the value of C.supp is uniquely defined by C.attr and C.val used in group by. We use min(C.supp) instead of C.supp just to make the SQL syntax correct. On the other hand, I guess that we could write C.supp instead of min(C.supp) and it would still somehow work for some platforms.

Table D_1 is almost identical to D_1 in reply #3. The only difference is that we have now additional column decval indicating the decision values for particular rules. The way of splitting the contents of D_1 between F_1 and FINAL RESULT with respect to the condition acc>=minacc is the same as before. However, the way of following with generation of D_2 is different. We first need to construct table C_2:

insert into C_2 
   select      c_1
.attr as attr_1c_1.val as val_1c_2.attr as attr_2c_2.val as val_2,
                   
count(*) as supp
   from        C c_1
C c_2T t_1T t_2
   where       attr_1 
attr_2 and
                   
t_1.attr c_1.attr and
                   
t_1.val c_1.val and
                   
t_2.attr c_2.attr and
                   
t_2.val c_2.val and
                   
t_1.rid t_2.rid
   group by    c_1
.attrc_1.valc_2.attrc_2.val
   having      supp 
>= minsup

And finally:

insert into D_2 
   select      C_2
.attr_1C_2.val_1C_2.attr_2C_2.val_2t_dec.val as decval,
                   
count(*) as suppcount(*)/min(C_2.supp) as acc
   from        C_2
F_1 f_1F_1 f_2T t_1T t_2T t_dec 
   where       C_2
.attr_1 f_1.attr and
                   
C_2.val_1 f_1.val and
                   
C_2.attr_2 f_2.attr and
                   
C_2.val_2 f_2.val and
                   
t_1.attr C_2.attr_1 and
                   
t_1.val C_2.val_1 and
                   
t_2.attr C_2.attr_2 and
                   
t_2.val C_2.val_2 and
                   
t_dec.val f_1.decval and
                   
t_dec.val f_2.decval and
                   
t_1.rid t_dec.rid and
                   
t_2.rid t_dec.rid
   group by    C_2
.attr_1C_2.val_1C_2.attr_2C_2.val_2t_dec.val
   having      supp 
>= minsup

The algorithm continues with splitting each next D_k-1 between FINAL RESULT and table F_k-1 depending on the values of acc. If F_k-1 is non-empty, we generate C_k. If C_k is non-empty, we generate D_k using F_k-1 and C_k. And so on and so forth…

Best greetings,

Dominik

[ Edited: 08 April 2009 04:18 AM by Dominik Slezak]
Signature 
Profile
 
Posted: 23 June 2009 02:08 AM   Ignore ]   [ # 9 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  453
Joined  2008-08-18

Hello,

Just a short update…

I found a nice link to a more “classical” framework for mining association rules using SQL:

http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/sigmod98_dbi_rj.pdf

Best greetings,

Dominik

Signature 
Profile