資料集的來源主要有兩種,內建資料集,或是由外部載入的資料集
……………………………………………………………………………………..
載入內建資料集
載入資料集
>>>from sklearn import datasets
iris資料集
包含三種不同種類的花朵特徵資料
ex:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris.data.shape
(150, 4)
>>>print(iris.data)
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
…omit…
>>>print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
ps
iris.data 就會是真實的 data 資料,
iris.target 就是要被 training 的 classes
iris.data.shape 就是資料的筆數和特徵數
疾病嚴重程度預測
diabete資料集包含442位病人的資料,每個病人收集十種特徵
ex:
>>> from sklearn import datasets
>>> diabetes = datasets.load_diabetes()
>>> print(diabetes.data.shape)
(442, 10)
>>> print diabetes.target
[ 151. 75. 141. 206. 135. 97. 138. 63. 110. 310. 101. 69.
179. 185. 118. 171. 166. 144. 97. 168. 68. 49. 68. 245.
184. 202. 137. 85. 131. 283. 129. 59. 341. 87. 65. 102.
265. 276. 252. 90. 100. 55. 61. 92. 259. 53. 190. 142.
75. 142. 155. 225. 59. 104. 182. 128. 52. 37. 170. 170.
…omit…
做圖片的文字辨識
使用的是UCI的digits資料集,共有1797筆64pixel的灰階圖片
ex:
>>> from sklearn import datasets
>>> digits = datasets.load_digits()
boston房價
適合做regression
ex:
>>> from sklearn import datasets
>>> boston = datasets.load_boston()
>>> boston.data.shape
(506, 13)
refer
http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
…
讀取外部資料集
載入外部資料
sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True,encoding=None, charset=None, charset_error=None, decode_error=’strict’, random_state=0)
ex:
from sklearn.datasets import load_files
load_files(container_path)
ps:
讀取外部資料也可用numpy loadtxt
…………………………………….
資料切割
載入資料後通常都會切成2部份做為training dataset和testing dataset,以下是常見的切法
資料切割
ex
倒數30筆做為test data,其餘為train data
>>> from sklearn import datasets
>>> diabetes = datasets.load_diabetes()
>>> print(diabetes.data.shape)
(442, 10)
>>> diabetes_train_data = diabetes.data[:-30]
>>> diabetes_train_target = diabetes.target[:-30]
>>> diabetes_test_data = diabetes.data[-30:]
>>> diabetes_test_target = diabetes.target[-30:]
資料切割by train_test_split
1
載入cross_validation
from sklearn.cross_validation import train_test_split
2
切割
d_train, d_test, t_train, t_test = train_test_split(data, target, test_size=0.2, random_state=0)
test_size 設定幾%資料為testdata,其餘為traindata
random_state 將亂數的狀態固定, 相同數值會讓每次執行時資料切割都會一樣,
ps:不設random_state資料會次都會打亂
ex:
from sklearn.cross_validation import train_test_split
train_data, test_data, train_target, test_target = train_test_split(iris.data,iris.target, test_size=0.2, random_state=0)
print 'train total:',len(train_target),', target:',sum(train_target)
print 'test total:',len(test_target),', target:',sum(test_target)
資料切割by ShuffleSplit
1
載入cross_validation
from sklearn.cross_validation import ShuffleSplit,cross_val_score
2
切割
cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1, random_state=0)
test_size=0.1 設定幾%資料為testdata,其餘為traindata
n_iter=3 做成 3 份,也就是 3-fold cv
n_samples 幾份資料
3
根據clf 計算各別分數
test_scores = cross_val_score(clf, data, target, cv=cv, n_jobs=2)
clf = < any classifier>
print test_scores
refer
http://hhtucode.blogspot.tw/2013/10/python-ml-with-scikit-learn-model.html