Mahout Logistic

mahout logistic regression
mahout的logistic regression使用SGD(stochastic gradient descent)來實做

訓練資料集
#mahout trainlogistic
常用參數如下
–input < file-or-resource> Uses the specified file or resource as input.
–output < file-for-model> Puts the model into the specified file.
–target < variable> Uses the specified variable as the target.
–categories < n> Specifies how many categories the target variable has.
–predictors < v1> … < vn> Specifies the names of the predictor variables.
–features
Sets the size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data.
–rate
Sets the initial learning rate. This can be large if you have lots of data or use lots of passes because it’s decreased progressively as data is examined.
–passes
Specifies the number of times the input data should be reexamined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined.

測試資料集
# mahout runlogistic
常用參數如下
–auc Prints AUC score for model versus input data after reading data.
–scores Prints target variable value and scores for each input example.
–confusion Prints confusion matrix for a particular threshold (see –threshold).
–input < input> Reads data records from specified file or resource.
–model < model> Reads model from specified file
顯示結果大致如下
AUC = 0.57
confusion: [[27.0, 13.0], [0.0, 0.0]]
說明
The AUC parameter is between 0 and 1, that means the number of true positives
The confusion means [[ TP , FP ], [FN, TN]]

……………………………………….


demo

解說
資料集的檔名為testclass
內容如下,3個屬性,和1個tag
#vi testclass
a,b,c,tag
1,1,1,1
2,2,2,1
3,3,3,1
4,4,4,1
5,5,5,1
6,6,6,2
7,7,7,2
8,8,8,2
9,9,9,2
10,10,10,2

執行以下指令建立一個資料模型(此階段為分類過程中的訓練階段)
# mahout trainlogistic –input testclass –output model –target tag –categories 2 –predictors a b c –types numeric –features 3
跑出以下結果
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7.0.1.3.3.0-58-job.jar
3
tag ~ -91.295*Intercept Term + 6.690*a + 6.690*b + 3.321*c
Intercept Term -91.29475
a 6.68964
b 6.68964
c 3.32069
6.689641057 -91.294753096 3.320686879
14/02/20 22:18:45 INFO driver.MahoutDriver: Program took 399 ms (Minutes: 0.00665)
ps:Intercept Term預設為1

資料模型為
-91.295 + 6.690*a + 6.690*b + 3.321*c

這代表
將a,b,c這個屬性的值套入工式-91.295 + 6.690*a + 6.690*b + 3.321*c
即可得到結果,如下
a,b,c,tag=>套入公式的結果
1,1,1,1 => -74.594
2,2,2,1 => -57.893
3,3,3,1 => -41.192
4,4,4,1 => -24.491
5,5,5,1 => -7.79
6,6,6,2 => 8.911
7,7,7,2 => 25.612
8,8,8,2 => 42.313
9,9,9,2 => 59.014
10,10,10,2 => 75.715
分界線(-74.594 + 75.715 )2 = 0.565
小於分界線為1,大於分界線為2


測試訓練結果(正常情況下,AUC為1)
# mahout runlogistic –input testclass –model model –auc –confusion
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7.0.1.3.3.0-58-job.jar
AUC = 1.00
confusion: [[5.0, 0.0], [0.0, 5.0]]
entropy: [[-0.0, NaN], [-33.1, -0.0]]
14/03/15 00:42:05 INFO driver.MahoutDriver: Program took 139 ms (Minutes: 0.0023166666666666665)