Mahout Clustering

Canopy分群演算法
#mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
常用參數如下
-i: Input vector’s directory
-o: Output working directory
-dm: Distance measure
 常用的dm選項如下
 org.apache.mahout.common.distance.CosineDistanceMeasure
 org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure (default)
-t1: T1 threshold
-t2: T2 threshold
-ow: Overwrite output directory if present

Canopy步驟
1.指定兩個距離值,T1(大圈)和T2(小圈), 其中T1 > T2
2. 將dataset向量化並放入list中
3. 從list中任取一點P
假如沒有任何Canopy,則P為Canopy
假如點P與某個Canopy距離在T1(大圈)內,則點P加入該Canopy
假如點P與某個Canopy距離在T2(小圈)內,則點P加入該Canopy,並從list中移除點P
4. 重覆步驟2直到list為空
ps:
t1:大圈,主要影響每個cluster內,點的數量
t2:小圈,主要影響cluster的數量,太大會導致只有一個cluser, 太小會導致太多cluster
ps
canopy較適合決定群數

…….

K-means分群演算法
#mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
常用參數如下
-i < hadoop input path > : Path to job input directory.
-o : The directory pathname for output.
-dm < distanceMeasure >: The classname of the DistanceMeasure.
 Default is SquaredEuclidean
-k < number of clusters> : The number of clusters to create
-t1 < threshold value> : T1 threshold value
-t2 < threshold value> : T2 threshold value
-x < number of iterations> : The maximum number of iterations.
-ow : If present, overwrite the output directory before running job
-h: Print out help

….

#mahout clusterdump
常用參數如下
-i < hadoop input path > : Path to job input directory.
-o : The directory pathname for output.
-p < points Dir> : The directory containing points sequence files mapping input vectors to their cluster. If specified, then the program will output the points associated with a cluster

輸出結果
VL-8{n=3 c=[5.667, 5.667] r=[0.471, 0.471]}
Weight : [props – optional]: Point:
1.0: [5.000, 6.000]
1.0: [6.000, 5.000]
1.0: [6.000, 6.000]
說明如下
VL-8: Cluster name
n: 這個群有幾個點
c: the center of Cluster
r: the radius of the cluster


……………………………………………………………………………………….

demo

test dataset
#vi clustering.data
0 1
1 0
1 1
2 1
1 2
2 2
5 6
6 5
6 6
9 9

# hadoop fs -mkdir testdata
# hadoop fs -put clustering.data testdata
# hadoop fs -ls -R testdata

-rw-r–r– 3 root hdfs 288374 2014-02-05 21:53 testdata/clustering.data

canopy分群
#mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job -t1 3 -t2 2 -i testdata -o output
…omit…
14/09/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters
14/09/08 01:31:07 INFO driver.MahoutDriver: Program took 104405 ms (Minutes: 1.7400833333333334)

#hadoop fs -ls -R output
drwxr-xr-x – root hdfs 0 2014-09-08 01:31 output/clusteredPoints
-rw-r–r– 3 root hdfs 0 2014-09-08 01:31 output/clusteredPoints/_SUCCESS
-rw-r–r– 3 root hdfs 512 2014-09-08 01:31 output/clusteredPoints/part-m-00000
drwxr-xr-x – root hdfs 0 2014-09-08 01:30 output/clusters-0-final
-rw-r–r– 3 root hdfs 0 2014-09-08 01:30 output/clusters-0-final/_SUCCESS
-rw-r–r– 3 root hdfs 202 2014-09-08 01:30 output/clusters-0-final/_policy
-rw-r–r– 3 root hdfs 649 2014-09-08 01:30 output/clusters-0-final/part-r-00000
drwxr-xr-x – root hdfs 0 2014-09-08 01:29 output/data
-rw-r–r– 3 root hdfs 0 2014-09-08 01:29 output/data/_SUCCESS
-rw-r–r– 3 root hdfs 382 2014-09-08 01:29 output/data/part-m-00000

#mahout clusterdump –input output/clusters-0-final –pointsDir output/clusteredPoints
C-0{n=1 c=[9.000, 9.000] r=[]}
Weight : [props – optional]: Point:
1.0: [9.000, 9.000]
C-1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]}
Weight : [props – optional]: Point:
1.0: [5.000, 6.000]
1.0: [6.000, 5.000]
1.0: [6.000, 6.000]
C-2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]}
Weight : [props – optional]: Point:
1.0: [1:1.000]
1.0: [0:1.000]
1.0: [1.000, 1.000]
1.0: [2.000, 1.000]
1.0: [1.000, 2.000]
1.0: [2.000, 2.000]


…………

指定K-means分成3群
#mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -k 3 -t1 3 -t2 2 -x 10 -i testdata -o output
…omit…
14/09/08 00:47:53 INFO clustering.ClusterDumper: Wrote 3 clusters
14/09/08 00:47:53 INFO driver.MahoutDriver: Program took 194271 ms (Minutes: 3.23785)

#hadoop fs -ls -R output
-rw-r–r– 3 root hdfs 194 2014-09-08 00:47 output/_policy
drwxr-xr-x – root hdfs 0 2014-09-08 00:47 output/clusteredPoints
-rw-r–r– 3 root hdfs 0 2014-09-08 00:47 output/clusteredPoints/_SUCCESS
-rw-r–r– 3 root hdfs 512 2014-09-08 00:47 output/clusteredPoints/part-m-00000
drwxr-xr-x – root hdfs 0 2014-09-08 00:45 output/clusters-0
-rw-r–r– 3 root hdfs 194 2014-09-08 00:45 output/clusters-0/_policy
-rw-r–r– 3 root hdfs 332 2014-09-08 00:45 output/clusters-0/part-00000
-rw-r–r– 3 root hdfs 332 2014-09-08 00:45 output/clusters-0/part-00001
-rw-r–r– 3 root hdfs 332 2014-09-08 00:45 output/clusters-0/part-00002
drwxr-xr-x – root hdfs 0 2014-09-08 00:46 output/clusters-1
-rw-r–r– 3 root hdfs 0 2014-09-08 00:46 output/clusters-1/_SUCCESS
-rw-r–r– 3 root hdfs 194 2014-09-08 00:46 output/clusters-1/_policy
-rw-r–r– 3 root hdfs 689 2014-09-08 00:46 output/clusters-1/part-r-00000
drwxr-xr-x – root hdfs 0 2014-09-08 00:46 output/clusters-2
-rw-r–r– 3 root hdfs 0 2014-09-08 00:46 output/clusters-2/_SUCCESS
-rw-r–r– 3 root hdfs 194 2014-09-08 00:46 output/clusters-2/_policy
-rw-r–r– 3 root hdfs 698 2014-09-08 00:46 output/clusters-2/part-r-00000
drwxr-xr-x – root hdfs 0 2014-09-08 00:47 output/clusters-3-final
-rw-r–r– 3 root hdfs 0 2014-09-08 00:47 output/clusters-3-final/_SUCCESS
-rw-r–r– 3 root hdfs 194 2014-09-08 00:47 output/clusters-3-final/_policy
-rw-r–r– 3 root hdfs 698 2014-09-08 00:47 output/clusters-3-final/part-r-00000
drwxr-xr-x – root hdfs 0 2014-09-08 00:45 output/data
-rw-r–r– 3 root hdfs 0 2014-09-08 00:45 output/data/_SUCCESS
-rw-r–r– 3 root hdfs 382 2014-09-08 00:45 output/data/part-m-00000
drwxr-xr-x – root hdfs 0 2014-09-08 00:45 output/random-seeds
-rw-r–r– 3 root hdfs 551 2014-09-08 00:45 output/random-seeds/part-randomSeed

#mahout clusterdump –input output/clusters-3-final –pointsDir output/clusteredPoints
VL-8{n=3 c=[5.667, 5.667] r=[0.471, 0.471]}
Weight : [props – optional]: Point:
1.0: [5.000, 6.000]
1.0: [6.000, 5.000]
1.0: [6.000, 6.000]
VL-9{n=1 c=[9.000, 9.000] r=[]}
Weight : [props – optional]: Point:
1.0: [9.000, 9.000]
VL-6{n=6 c=[1.167, 1.167] r=[0.687, 0.687]}
Weight : [props – optional]: Point:
1.0: [1:1.000]
1.0: [0:1.000]
1.0: [1.000, 1.000]
1.0: [2.000, 1.000]
1.0: [1.000, 2.000]
1.0: [2.000, 2.000]