frequent pattern analysis
frequent pattern/itemset:在data set中常出現的頻率
ex:a set of item,subsequences,substructures
目地:找出資料固定的規則
應用:basket data analysis,cross-marketing(市場交互情形),catalog design(產品搭配的設計),…等
基本概念
找出x,y的minimum support和confidence
support:x,y同時發生的比率
confidence x=>y: x發生時,y發生的比例
主要步驟
1.find all frequent itemset:
frequently需大於minimum support
2.generate strong association rule from the frequent itemset
需滿足minimum support和minimum confidence
(option)3.correlation analysis for association rule
………………………………………………………………………………………………………………..
1.find all frequent itemset
常見的scalable mining method如下
apriori:簡單的方法,但耗時,因為要比較candidate,而且要一直scan
FPgrowth:計算快速,因為不用產生candidate,而且scan僅需一次
vertical data format approach:將欄與列轉換,方法簡單,比apriori快
…..
apriori
方法致如下
1.scan:計算每個itemset support,得到Cn
2.compare:各itemset(candidate) support和mini support比較,以得到Ln
3.根據Ln產生Cn+1
4.重覆1-3直到結束
apriori做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)
產生l3如下
C1:scan itemset : sup 1 : 6 2 : 7 3 : 6 4 : 2 5 : 2 | L1:大於min sup之candidate itemset : sup 1 : 6 2 : 7 3 : 6 4 : 2 5 : 2 | |
generate C2 itemset 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5 | C2:scan itemset : sup 1,2 : 4 1,3 : 4 1,4 : 1(小於min support) 1,5 : 2 2,3 : 4 2,4 : 2 2,5 : 2 3,4 : 0(小於min support) 3,5 : 1(小於min support) 4,5 : 0(小於min support) | L2:大於min sup之candidate itemset : sup 1,2 : 4 1,3 : 4 1,5 : 2 2,3 : 4 2,4 : 2 2,5 : 2 |
generate C3 itemset 1,2,3 1,2,5 1,3,5(因3,5:1所以不用) 2,3,5(因3,5:1所以不用) 2,3,4(因3,4:0所以不用) 2,4,5(因4,5:0所以不用) | C3:scan itemset : sup 1,2,3 : 2 1,2,5 : 2 | L3:大於min sup之candidate itemset : sup 1,2,3 : 2 1,2,5 : 2 |
generage C4 itemset 1,2,3,5 | C4:scan itemset : sup 1,2,3,5: 1(小於min support) | L4:大於min sup之candidate itemset : sup |
so
L3 {1,2,5*2} {1,2,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{1,2*4}
…..
FPgrowth
方法大致如下
1,compute support and sort by support
2,refresh by step1 result
3,create tree by step2 result
4,conditional pattern base => frequent patterns generated
FPgrowth做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)
1
compute support(若有小於support則將他排除) and sort by support
itemset : sup
2 : 7
1 : 6
3 : 6
4 : 2
5 : 2
2
refresh by step1 result
a: 2,1,5
b: 2,4
c: 2,3
d: 2,1,4
e: 1,3
f: 2,3
g: 1,3
h: 2,1,3,5
i: 2,1,3
3
create tree by step2 result
null
|_2(7)
| |_1(4) (1 base 2*1)
| | |_5(1) [a=2,1,5] (5 base 2,1*1)
| | |_3(2) [i=2,1,3] (3 base 2,1*2)
| | | |_5(1) [h=2,1,3,5] (5 base 2,1,3*1)
| | |
| | |_4(1) [d=2,1,4] (4 base 2,1*1)
| |
| |_3(2) [c,f=2,3] (3 base 2*2)
| |_4(1) [b=2,4] (4 base 2*1)
|
|_1(2)
|_3(2) [g,e=1,3] (3 base 1*2)
4
conditional pattern base => frequent patterns generated
5 ((2,1*1),(2,1,3*1)) => {2,5*2},{1,5*2},{2,1,5*2} //{3,5*1} 小於mini sup
4 ((2,1*1),(2*1)) => {2,4*2} //{2,1*1} 小於mini sup
3 ((2,1*2),(2*2),(1*2)) => {2,3*4},{1,3*4},{2,1,3*2}
1 (2*4) => {2,1*4}
so
L3 {2,1,5*2} {2,1,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{2,1*4}
…..
vertical data format approach
方法大致如下
1欄列轉換
2將前一步驟未小於min support的值做組合並交集
vertical data format approach做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)
1
欄列轉換
itemset : sup
1:a,d,e,g,h,i
2:a,b,c,d,f,h,i
3:c,e,f,g,h,i
4:b,d
5:a,h
2
將前一步驟未小於min support的值做組合並交集
12:a,h,i
13:e,g,h,i
14:d (小於min support)
15:a,h
23:c,f,h,i
24:b,d
25:a,h
34: (小於min support)
35:h (小於min support)
45: (小於min support)
3
將前一步驟未小於min support的值做組合並交集
1,2,3:h,i
1,2,5:a,h
1,3,5:h (因3,5:0所以不用)
2,3,4: (因3,4:0所以不用)
2,3,5: (因3,5:0所以不用)
2,4,5: (因4,5:0所以不用)
so
L3 {2,1,5*2} {2,1,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{2,1*4}
……………………………………………………………………………………………………………
2.generate strong association rule from the frequent itemset
x=>y(confidence): confidence=x發生時,y發生的比例
做法範例(if confidence threshold is 70%,support as above)
{2,5*2} 2=>5(2/7) ; 5=>2(2/2=100%>70%)
{1,5*2} 1=>5(2/6) ; 5=>1(2/2 =100%>70%)
{2,4*2} 2=>4(2/7) ; 4=>2(2/2=100%>70%)
{2,3*4} 2=>3(4/7) ; 3=>2(4/6)
{1,3*4} 1=>3(4/6) ; 3=>1(4/6)
{2,1*4} 1=>2(4/6) ; 2=>1(4/7)
{2,1,5*2}
1^2=>5(2/4=50%)
1^5=>2(2/2=100%>70%)
2^5=>1(2/2=100%>70%)
1=>2^5(2/6=33%)
2=>1^5(2/7=29%)
5=>1^2(2/2=100%>70%)
{2,1,3*2}
1^2=>3 ,
1^3=>2
2^3=>1 ,
1=>2^3 ,
2=>1^3 ,
3=>1^2 ,
結果可寫成以下
x=>y(support,confidence)
5=>2(40%,100%)
5=>1(40%,100%)
4=>2(40%,100%)
……………………………………..
找出各種association rules
常見有以下
mining multilevel association,
mining multidimensional association
mining quantitave association
mining interesting correlation patterns
mining multilevel association,
各層support計算方式主要分為兩種
uniform support:每一層都用相同的support
reduced support:越下層用越少的support
ex:第一層virus[sup=10%],第二層可分為file virus[sup=4%]和boot virus[sup=6%]
uniform support=5%,則第二層file virus[sup=4%]則不被找到
reduced support,第一層=5%,第二層=3%,即可全部找到
mining multidimensional association
主要分兩種rule
inter-dimension association rules(no repeated) ex:age(20-30)^job(student)=>buy(coke)
hybrid-dimension association rule(repeated) ex:age(20-30)^buy(pizza)=>buy(coke)
mining quantitave association
1將數值先切割成各段
2找出association
ex:age(20)^income(10-20k)=>buy(pc)
…………………………………………………………………………………………………………….
(option)3.correlation analysis for association rule
只靠support及confidence有時不夠,因此有時會加correlation
ex:
假設在10000筆購買記錄中,買computer有6000人,買lcd有7500人,兩者同時買有4000人
且min support=30%,min confidence=60%
假如發現以下association rule
buy(computer)=>buy(lcd)[support=40%,confidence=66%]
買computer的人,會有66%買lcd
但是因為根據購買記錄可以看到,買lcd機率為75%,所以該association rule沒有幫助
常見的correlation measures有
chi-squre
lift
all_conf,較適合用於association rule
coh,較適合用於association rule
假設virus和p2p的contingency table如下
virus | no virus | sum | |
p2p | v,p (100) | !v,p (200 ) | p (300) |
no p2p | v,!p (300) | !v,!p (400) | !p (700) |
sum | v (400) | !v (600) | total (1000) |
lift(A,B)=p(A交集B)/p(A)*p(B)
lift(v,p)=(v,p/total)/((v/total)*(p/total))=(100/1000)/((400/1000)*(300/1000))=0.83
lift(v,!p)=(v,!p/total)/((v/total)*(!p/total))=(300/1000)/((400/1000)*(700/1000))=1.07
all_conf(X)=sup(X)/max_item_sup(X)
all_conf(v)=v,p/max(v)=100/max(400,300)=100/400=0.25
小於0.5,表示不獨立
coh(x)=sup(X)/|universe(X)|
coh(v)=v,p/(v,p+!v,p+v,!p)=100/100+300+200=100/600=0.20
小於0.5,表示不獨立
cosine =(100/1000) /(sqrt(400/1000)+sqrt(300/1000))=0.0847
ps:
在min support方面
virus,p2p = 100/1000=10%
virus = 400/1000 =40%
p2p = 300/1000=30%
ps:
在min confidence方面
virus=>p2p( 100/400=25% )
p2p=>virus(100/300 =33%)