決策樹
應用:
常用在分類,也就是先從資料中找出規則,在利用該規則從其他資料中找出結果
常見方法:
1.C3 and C4.5
2.gini index:只可處理屬性僅2種類別的資料,超過要進行類別的合併
3.其他方法有CHAID,C-SEP,G-statistics,MDL
概念
由上而下遞迴的分裂和克服的方法
停止條件
1所有結果都屬於同一類時,ex:全部的結果都是yes,而沒有no時,或相反
2己沒有屬性可以做分裂時,其餘結果以多數結果代表這部份的全體
3己沒有資料,以多數結果代表這部份的全體
限制
若產生的分支太多則之後進行分類預測時效果會變差
可使用以下兩方法解決
prepruning:樹在成長時,若超過門檻值則立即剪掉
postpruning:樹成長完後,在把不必要的分支剪掉
continuous value處理方式
主要有兩種做法
1,自行定義continuous value各範圍屬於一個值
ex:1-20year定義為young,60-90year定義為older
2,計算各結果的gain,等同各結果都掃描一次,很花時間
ex:value有23,25,28,31
則先計算小於及大於(23+25)/2的增益,一直算到小於及大於(28+31)/2的增益,
看那一種增益結果較好,則選那個組合
C3的計算方式
1
info(D)= -segma( i , Pi*log2(Pi) )
i=class屬性中的不同值,ex:{yes,no},{high,medium,low}
Pi=class屬性中各值在該class發生的機率
2
info_a(D)=segma( j , Dj/D*info(Dj) )
gain(a)=info(D)-info_a(D)
3
new node = max(gain_all)
根據該屬性的值產生分支
4
各分支回到步驟1進行計算
ps
在c4.5時,以gainratio代替gain,這可提昇attribute selection的結果
gainratio(a)=gain(a)/splitinfo(a)
splitinfo(a)=segma_j( j , Dj/D*log2(Dj/D) )
gini index的計算方式
1
gini(D)=1-sigma( i ,Pi^2)
2
gini_a(D)=D1/D*gini(D1)+D2/D*gini(D2)
ps:因為gini index只接受兩個類別,若出現D3時,需和D1或D2合併
gini(a)=gini(D)-gini_a(D)
3
new node = min(gini_all)
根據該屬性的值產生分支
4
各分支回到步驟1進行計算
################################################################
範例
ex:
rid | flow | packet | student | over1gb | class:is infected? |
1 | low | high | no | no | no |
2 | low | high | no | yes | no |
3 | medium | high | no | no | yes |
4 | high | medium | no | no | yes |
5 | high | low | yes | no | yes |
6 | high | low | yes | yes | no |
7 | medium | low | yes | yes | yes |
8 | low | medium | no | no | no |
9 | low | low | yes | no | yes |
10 | high | medium | yes | no | yes |
11 | low | medium | yes | yes | yes |
12 | medium | medium | no | yes | yes |
13 | medium | high | yes | no | yes |
14 | high | medium | no | yes | no |
使用C3
1
P1=yes機率=9/14
P2=no機率=5/14
info(D)= ( -P1*log2(P1)) + ( – P2*log2(P2))
= ( -9/14*log2(9/14)) + ( -5/14*log2(5/14) )
=0.94
2
info_flow(D)=D1/D*( info(D1)) + D2/D*( info(D2)) + D3/D*( info(D3))
=D1/D*( -P1*log2(P1) – P2*log2(P2) )
+ D2/D*( -P1*log2(P1) – P2*log2(P2) )
+ D3/D*( -P1*log2(P1) – P2*log2(P2) )
=5/14*( -2/5*log2(2/5) -3/5*log2(3/5) )
+ 4/14*( -4/4*log2(4/4) -0/4*log2(0/4) )
+ 5/14*( -3/5*log2(3/5) -2/5*log2(2/5) )
=0.694
gain(flow)=0.246=0.940-0.694=info(D)-info_flow(D)
similarly
gain(packet)=0.029
gain(student)=0.151
gain(over1gb)=0.048
3
max(gain_all)=gain(flow)
attribute flow become root node
共根據flow的值分成3個分支,分別是flow_low,flow_medium,flow_high
ps:
若使用C4.5
splitpacket(packet)=0.926= -4/14*log2(4/14) -6/14*log2(6/14) -4/14*log2(4/14)
gainratio(packet)=0.029/0.926=0.031
splitpacket(flow)=1.577= -5/14*log2(5/14) -4/14*log2(4/14) -5/14*log2(5/14)
gainratio(flow)=0.246/1.577=0.156
similarly
gainratio(student)=
gainratio(over1gb)=
該範例結果相同
max(gain_all)=gain(flow)
attribute flow become root node
ps:
若使用gini index
1
gini(D)=1-(9/14)^2-(5/14)^2=0.459
2
gini_student(D)=0.367
=7/14*gini(D1) + 7/14*gini(D2)
=7/14*(1-(6/7)^2 -(1/7)^2 ) + 7/14*(1-(3/7)^2-(4/7)^2)
gini_packet(D)=0.45
= gini_packet{low,medium} + gini_packet{high} //有三種值時需合併部份的值,變成2種值
=10/14*gini(D1) + 4/14*gini(D2)
=10/14*( 1-(6/10)^2-(4/10)^2 ) + 4/14*( 1-(1/4)^2-(3/4)^2 )
similarly
gini_over1gb(D)=0.429
gini_flow(D)=0.375
3
min(gini_all)=gini(packet)=gini(D)-gini_packet(D)=0.459-0.45
attribute packet become root node
並根據packet的值分成3個分支,分別是packet_low,packet_medium,packet_high
此結果與C3和C4.5不同
………………………..
分支flow_low,該分支的class不同,因此在進行同樣的計算
flow | packet | student | over1gb | class: is infected? |
low | high | no | no | no |
low | high | no | yes | no |
low | medium | no | no | no |
low | low | yes | no | yes |
low | medium | yes | yes | yes |
1
info(D1)=同上=5/14*( -2/5*log2(2/5) -3/5*log2(3/5) ) =0.972
2
info_packet(D1)=D1/D*( info(D1 )) + D2/D*( info(D2 )) + D3/D*( info(D3 ))
=D1/D*( P1*log2(P1) – P2*log2(P2) )
+ D2/D*( P1*log2(P1) – P2*log2(P2) )
+ D3/D*( P1*log2(P1) – P2*log2(P2) )
=1/5*( -1/1*log2(1/1) -0/1*log2(0/1) )
+ 2/5*( -1/2*log2(1/2) -1/2*log2(1/2) )
+ 2/5*( -0/2*log2(0/2) -2/2*log2(2/2) )
=0+2/5+0
=0.4
gain_D1(packet)=0.972-0.4=0.572
similarly
gain_D1(student)=0.972
gain_D1(over1gb)=
3
max(gain_all)=gain_D1(student)=0.972
attribute student become node
共根據student的值分成2個分支,分別是student_yes,student_no
分支flow_low的分支student_yes,該分支的class全相同,因此該分支到此結束
flow | packet | student | over1gb | class:is infected? |
low | low | yes | no | yes |
low | medium | yes | yes | yes |
分支flow_low的分支student_no,該分支的class全相同,因此該分支到此結束
flow | packet | student | over1gb | class:is infected |
low | high | no | no | no |
low | high | no | yes | no |
low | medium | no | no | no |
……………………………..
分支flow_medium,該分支的class全相同,因此該分支到此結束
flow | packet | student | over1gb | class:is infected |
medium | high | no | no | yes |
medium | low | yes | yes | yes |
medium | medium | no | yes | yes |
medium | high | yes | no | yes |
……………………………….
分支flow_high,該分支的class不同,因此在進行同樣的計算
flow | packet | student | over1gb | class:is infected |
high | medium | no | no | yes |
high | low | yes | no | yes |
high | low | yes | yes | no |
high | medium | yes | no | yes |
high | medium | no | yes | no |
1
info(D3)=同上=5/14*( -2/5*log2(2/5) -3/5*log2(3/5) ) =0.972
2
gain_D3(packet)=
gain_D3(student)=
gain_D3(over1gb)=0.972
3
max(gain_all)=gain_D3(over1gb)=0.972
attribute over1gb become node
並根據over1gb的值分成2個分支,分別是over1gb_yes,over1gb_no
分支flow_high的分支over1gb_no,該分支的class全相同,因此該分支到此結束
flow | packet | student | over1gb | class:is infected |
high | medium | no | no | yes |
high | low | yes | no | yes |
high | medium | yes | no | yes |
分支flow_high的分支over1gb_yes,該分支的class全相同,因此該分支到此結束
flow | packet | student | over1gb | class:is infected |
high | low | yes | yes | no |
high | medium | no | yes | no |
……………………………
結果:
infected signature如下
1.flow:low , student:yes
2.flow:medium
3.flow:high , over1gb:yes