Frequent Pattern Analysis

frequent pattern analysis
frequent pattern/itemset:在data set中常出現的頻率
ex:a set of item,subsequences,substructures
目地:找出資料固定的規則
應用:basket data analysis,cross-marketing(市場交互情形),catalog design(產品搭配的設計),…等

基本概念
找出x,y的minimum support和confidence
support:x,y同時發生的比率
confidence x=>y: x發生時,y發生的比例

主要步驟
1.find all frequent itemset:
frequently需大於minimum support
2.generate strong association rule from the frequent itemset
需滿足minimum support和minimum confidence
(option)3.correlation analysis for association rule

………………………………………………………………………………………………………………..

1.find all frequent itemset

常見的scalable mining method如下
apriori:簡單的方法,但耗時,因為要比較candidate,而且要一直scan
FPgrowth:計算快速,因為不用產生candidate,而且scan僅需一次
vertical data format approach:將欄與列轉換,方法簡單,比apriori快

…..

apriori
方法致如下
1.scan:計算每個itemset support,得到Cn
2.compare:各itemset(candidate) support和mini support比較,以得到Ln
3.根據Ln產生Cn+1
4.重覆1-3直到結束

apriori做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)
產生l3如下

	C1:scan itemset : sup 1 : 6 2 : 7 3 : 6 4 : 2 5 : 2	L1:大於min sup之candidate itemset : sup 1 : 6 2 : 7 3 : 6 4 : 2 5 : 2
generate C2 itemset 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5	C2:scan itemset : sup 1,2 : 4 1,3 : 4 1,4 : 1(小於min support) 1,5 : 2 2,3 : 4 2,4 : 2 2,5 : 2 3,4 : 0(小於min support) 3,5 : 1(小於min support) 4,5 : 0(小於min support)	L2:大於min sup之candidate itemset : sup 1,2 : 4 1,3 : 4 1,5 : 2 2,3 : 4 2,4 : 2 2,5 : 2
generate C3 itemset 1,2,3 1,2,5 1,3,5(因3,5:1所以不用) 2,3,5(因3,5:1所以不用) 2,3,4(因3,4:0所以不用) 2,4,5(因4,5:0所以不用)	C3:scan itemset : sup 1,2,3 : 2 1,2,5 : 2	L3:大於min sup之candidate itemset : sup 1,2,3 : 2 1,2,5 : 2
generage C4 itemset 1,2,3,5	C4:scan itemset : sup 1,2,3,5: 1(小於min support)	L4:大於min sup之candidate itemset : sup

so
L3 {1,2,5*2} {1,2,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{1,2*4}

…..

FPgrowth
方法大致如下
1,compute support and sort by support
2,refresh by step1 result
3,create tree by step2 result
4,conditional pattern base => frequent patterns generated

FPgrowth做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)

1
compute support(若有小於support則將他排除) and sort by support
itemset : sup
2 : 7
1 : 6
3 : 6
4 : 2
5 : 2

2
refresh by step1 result
a: 2,1,5
b: 2,4
c: 2,3
d: 2,1,4
e: 1,3
f: 2,3
g: 1,3
h: 2,1,3,5
i: 2,1,3

4
conditional pattern base => frequent patterns generated
5 ((2,1*1),(2,1,3*1)) => {2,5*2},{1,5*2},{2,1,5*2} //{3,5*1} 小於mini sup
4 ((2,1*1),(2*1)) => {2,4*2} //{2,1*1} 小於mini sup
3 ((2,1*2),(2*2),(1*2)) => {2,3*4},{1,3*4},{2,1,3*2}
1 (2*4) => {2,1*4}

so
L3 {2,1,5*2} {2,1,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{2,1*4}

…..

vertical data format approach
方法大致如下
1欄列轉換
2將前一步驟未小於min support的值做組合並交集

vertical data format approach做法範例
id: items
a: 1,2,5
b: 2,4
c: 2,3
d: 1,2,4
e: 1,3
f: 2,3
g: 1,3
h: 1,2,3,5
i: 1,2,3
itemset mini support=2,or 22%(=2/9)

1
欄列轉換
itemset : sup
1:a,d,e,g,h,i
2:a,b,c,d,f,h,i
3:c,e,f,g,h,i
4:b,d
5:a,h

2
將前一步驟未小於min support的值做組合並交集
12:a,h,i
13:e,g,h,i
14:d　　(小於min support)
15:a,h
23:c,f,h,i
24:b,d
25:a,h
34:　　(小於min support)
35:h (小於min support)
45:　　(小於min support)

3
將前一步驟未小於min support的值做組合並交集
1,2,3:h,i
1,2,5:a,h
1,3,5:h (因3,5:0所以不用)
2,3,4: (因3,4:0所以不用)
2,3,5: (因3,5:0所以不用)
2,4,5: (因4,5:0所以不用)

so
L3 {2,1,5*2} {2,1,3*2}
L2 {2,5*2},{1,5*2},{2,4*2} ,{2,3*4},{1,3*4},{2,1*4}

……………………………………………………………………………………………………………

2.generate strong association rule from the frequent itemset

x=>y(confidence): confidence=x發生時,y發生的比例

做法範例(if confidence threshold is 70%,support as above)
{2,5*2} 2=>5(2/7) ; 5=>2(2/2=100%>70%)　
{1,5*2} 1=>5(2/6) ; 5=>1(2/2 =100%>70%)
{2,4*2} 2=>4(2/7) ; 4=>2(2/2=100%>70%)
{2,3*4} 2=>3(4/7) ; 3=>2(4/6)
{1,3*4} 1=>3(4/6) ; 3=>1(4/6)
{2,1*4} 1=>2(4/6) ; 2=>1(4/7)
{2,1,5*2}
　1^2=>5(2/4=50%)
　1^5=>2(2/2=100%>70%)
　2^5=>1(2/2=100%>70%)
　1=>2^5(2/6=33%)
　2=>1^5(2/7=29%)
　5=>1^2(2/2=100%>70%)
{2,1,3*2}
　1^2=>3 ,
　1^3=>2
　2^3=>1 ,
　1=>2^3 ,
　2=>1^3 ,
　3=>1^2 ,

結果可寫成以下
x=>y(support,confidence)
5=>2(40%,100%)　
5=>1(40%,100%)
4=>2(40%,100%)

……………………………………..

找出各種association rules
常見有以下
mining multilevel association,
mining multidimensional association
mining quantitave association
mining interesting correlation patterns

mining multilevel association,
各層support計算方式主要分為兩種
　uniform support:每一層都用相同的support
　reduced support:越下層用越少的support
ex:第一層virus[sup=10%],第二層可分為file virus[sup=4%]和boot virus[sup=6%]
uniform support=5%,則第二層file virus[sup=4%]則不被找到
reduced support,第一層=5%,第二層=3%,即可全部找到

mining multidimensional association
主要分兩種rule
inter-dimension association rules(no repeated) ex:age(20-30)^job(student)=>buy(coke)
hybrid-dimension association rule(repeated) ex:age(20-30)^buy(pizza)=>buy(coke)

mining quantitave association
1將數值先切割成各段
2找出association
ex:age(20)^income(10-20k)=>buy(pc)

…………………………………………………………………………………………………………….

(option)3.correlation analysis for association rule

只靠support及confidence有時不夠,因此有時會加correlation
ex:
假設在10000筆購買記錄中,買computer有6000人,買lcd有7500人,兩者同時買有4000人
且min support=30%,min confidence=60%
假如發現以下association rule
buy(computer)=>buy(lcd)[support=40%,confidence=66%]
買computer的人,會有66%買lcd
但是因為根據購買記錄可以看到,買lcd機率為75％,所以該association rule沒有幫助

常見的correlation measures有
chi-squre
lift
all_conf,較適合用於association rule
coh,較適合用於association rule

假設virus和p2p的contingency table如下

	virus	no virus	sum
p2p	v,p (100)	!v,p (200 )	p (300)
no p2p	v,!p (300)	!v,!p (400)	!p (700)
sum	v (400)	!v (600)	total (1000)

lift(A,B)=p(A交集B)/p(A)*p(B)
lift(v,p)=(v,p/total)/((v/total)*(p/total))=(100/1000)/((400/1000)*(300/1000))=0.83
lift(v,!p)=(v,!p/total)/((v/total)*(!p/total))=(300/1000)/((400/1000)*(700/1000))=1.07

all_conf(X)=sup(X)/max_item_sup(X)
all_conf(v)=v,p/max(v)=100/max(400,300)=100/400=0.25
小於0.5,表示不獨立

coh(x)=sup(X)/|universe(X)|
coh(v)=v,p/(v,p+!v,p+v,!p)=100/100+300+200=100/600=0.20
小於0.5,表示不獨立

cosine =(100/1000) /(sqrt(400/1000)+sqrt(300/1000))=0.0847

ps:
在min support方面
virus,p2p = 100/1000=10%
virus = 400/1000 =40%
p2p = 300/1000=30%
ps:
在min confidence方面
virus=>p2p( 100/400=25% )
p2p=>virus(100/300 =33%)

1.find all frequent itemset

2.generate strong association rule from the frequent itemset

(option)3.correlation analysis for association rule

分享此文：