SKLearn Feature

特徵減少最常的兩種方式
特徵壓縮(降維):透過演算法將多個特徵壓縮成少量特徵
特徵挑選:透過演算法將最具有代表性的特徵挑出來

………………………………………………………………………………..

Decomposing(特徵壓縮)

降維的方式很多,像是PCA,TruncatedSVD,LDA等等,以下介紹PCA的方法
refer
http://scikit-learn.org/stable/modules/decomposition.html

PCA
僅降維不挑選關鍵維度
ps:
PCA doesn’t eliminate dimensions and keeps others from the original data. It transforms your data in a number of dimensions whose data are completely different from the original ones.
refer
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA


載入模組
from sklearn.decomposition import PCA
指定要降到幾維
pca = PCA(n_components=2)
指定要降維的資料
data_pca = pca.fit_transform(data)


example
將3維的iris data變成2維
>>>from sklearn.decomposition import PCA
>>>pca = PCA(n_components=2)
>>>data_pca = pca.fit_transform(iris.data)
>>>print data_pca.shape
(150, 2)
>>> print(pca.explained_variance_ratio_)
array([ 0.92461621, 0.05301557])


example
將iris資料集用PCA變成2維並畫在一張圖上

#!/usr/bin/env python

import pylab as pl
from itertools import cycle
from sklearn.decomposition import PCA
from sklearn import datasets
import numpy as np

def plot_2D(data, target, target_names):
 colors = cycle('rgbcmykw')
 target_ids = range(len(target_names))
 pl.figure()
 for i, c, label in zip(target_ids, colors, target_names):
  pl.scatter(data[target == i, 0], data[target == i, 1],c=c, label=label)
 pl.legend()
 pl.savefig("pca.png")

iris = datasets.load_iris()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(iris.data)

np.round(X_pca.mean(axis=0), decimals=5)
np.round(X_pca.std(axis=0), decimals=5)
np.round(np.corrcoef(X_pca.T), decimals=5)

plot_2D(X_pca, iris.target, iris.target_names)

ps:
pl.scatter
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter

ps:
colors = cycle(‘rgbcmykw’)
b: blue
g: green
r: red
c: cyan
m: magenta
y: yellow
k: black
w: white

ps:
matplotlib.colors
http://matplotlib.org/api/colors_api.html

……………………………………………………………………………………… 

Feature Selection(特徵挑選)

特徵挑選的方式有很多方法,常見的有以下三種:
Univariate feature selection 
Recursive feature elimination 
Feature selection using SelectFromModel

以下主要介紹Univariate feature selection的方法,此方法的原理是分别單獨的计算每个特徵的某個統計指標,在根據該指標判斷重要性

……………………………………………………………………………………


Univariate feature selection

主要有以下幾種方法
SelectKBest:排名排在前n個的特徵
SelectPercentile:排名排在前n%的特徵
 

SelectPercentile(score_func, percentile)
score_func : callable, ex:f_classif
Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues).
percentile : int, optional, default=10
Percent of features to keep.

SelectKBest(score_func,k)
score_func : callable, ex:chi2
Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues).
: int or “all”, optional, default=10
Number of top features to select. The “all” option bypasses selection, for use in a parameter search.

score_func的選擇
 對於regression問題,可以用f_regression
 對於classification問題,可以用chi2或者f_classif
ex:
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=10)


example by SelectKBest

ex:
classfication的特徵選取
#vi univariate.py

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2,k=3)
result=selector.fit(X, y)
print result.scores_
x2=selector.transform(X) #從原始資料集中取出前三好的特徵做為新資料集
print x2

#python univariate.py
[ 10.81782088 3.59449902 116.16984746 67.24482759]
[[ 5.1 1.4 0.2]
[ 4.9 1.4 0.2]
[ 4.7 1.3 0.2]
[ 4.6 1.5 0.2]
…omit…
說明
第3個欄位數值最大,表示最有影響力
第2個欄位數值最小,表示影響效果最差
[ 10.81782088 3.59449902 116.16984746 67.24482759]

ex:
regression的特徵選取
#vi univariate.py

from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(f_regression,k=2)
result=selector.fit(X, y)
print result.scores_
x2=selector.transform(X)
print x2



refer
https://machine-learning-python.kspax.io/Feature_Selection/intro.html
http://www.cnblogs.com/jasonfreak/p/5448385.html
http://sklearn.lzjqsdd.com/modules/feature_selection.html