Hello,
One of my academic colleagues is interested in generating association rules from non-deterministic information systems, i.e., roughly speaking, data tables where values of rows for some attributes may take a form of sets instead of single elements.
It certainly motivated me to revisit my feelings about SQL-based computation of frequent itemsets. Below I’m still considering deterministic case because, otherwise, it would get too complicated too quickly. However, you’ll hear about non-deterministic information systems again pretty soon.
First of all, I’m not talking any more about the classical basket analysis, as in the previous post in this thread. We are now interested in building itemsets (patterns, templates) using arbitrary frequently occurring attribute-value combinations. Accordingly, we need to change a way of representing data. Please refer to the attachment with this respect.
The query discussed in the previous post can be now rewritten as follows:
insert into F_k
select attr_1, val_1,…, attr_k, val_k, count(*) as supp
from C_k, T t_1,…,T t_k
where t_1.attr = C_k.attr_1 and
t_1.val = C_k.val_1 and
…
t_k.attr = C_k.attr_k and
t_k.val = C_k.val_k and
t_1.rid = t_2.rid and
...
t_k-1.rid = t_k.rid
group by attr_1, val_1,…, attr_k, val_k
having supp >= minsup;
wherein:
T is the analyzed data (see the attachment);
rid – row identifier;
attr – attribute identifier;
val – value identifier;
C_k – table with candidate k-itemsets;
F_k – table with frequent k-itemsets;
attr_1, val_1,…, attr_k, val_k – columns in C_k and F_k;
minsup – minimum itemset support.
Let’s actually start with something basic and generate F_1. Let’s call it simply F:
insert into F
select attr, val, count(*) as supp
from T
group by attr, val
having supp >= minsup;
Now, in order to compute F_2 according to the above scenario, we would need to prepare C_2 first. It looks simple. It’s about putting (attr,val) pairs in F into quadruples (attr_1,val_1,attr_2,val_2). We just need to avoid the cases of attr_1=attr_2 and repetitions (attr_2,val_2,attr_1,val_1).
On the other hand, wouldn’t it be nice to have a query that doesn’t require such an intermediate step related to C_k? Please check out the following version and let me know what you think. We’re going to use the above-derived F (F_1) to calculate F_k directly from F_k-1. Also note that additional condition F_k-1.attr_k-1 < F.attr enables us to avoid unwanted repetitions and invalid itemsets. On the other hand, the result set is complete and nicely presented with respect to attributes used.
insert into F_k
select F_k-1.attr_1, F_k-1.val_1,…, F_k-1.attr_k-1, F_k-1.val_k-1, F.attr, F.val,
count(*) as supp
from F_k-1, F, T t_1,…,T t_k
where F_k-1.attr_k-1 < F.attr and
t_1.attr = F_k-1.attr_1 and
t_1.val = F_k-1.val_1 and
…
t_k-1.attr = F_k-1.attr_k-1 and
t_k-1.val = F_k-1.val_k-1 and
t_k.attr = F.attr and
t_k.val = F.val and
t_1.rid = t_2.rid and
...
t_k-1.rid = t_k.rid
group by F_k-1.attr_1, F_k-1.val_1,…, F_k-1.attr_k-1, F_k-1.val_k-1, F.attr, F.val
having supp >= minsup;
Best greetings and looking forward to comments,
Dominik