SKLearn Dataset

資料集的來源主要有兩種,內建資料集,或是由外部載入的資料集

……………………………………………………………………………………..

載入內建資料集


載入資料集
>>>from sklearn import datasets


iris資料集
包含三種不同種類的花朵特徵資料
ex:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris.data.shape
(150, 4)
>>>print(iris.data)
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
…omit…
>>>print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
ps
iris.data 就會是真實的 data 資料, 
iris.target 就是要被 training 的 classes
iris.data.shape 就是資料的筆數和特徵數


疾病嚴重程度預測
diabete資料集包含442位病人的資料,每個病人收集十種特徵
ex:
>>> from sklearn import datasets
>>> diabetes = datasets.load_diabetes()
>>> print(diabetes.data.shape)
(442, 10)
>>> print diabetes.target
[ 151. 75. 141. 206. 135. 97. 138. 63. 110. 310. 101. 69.
179. 185. 118. 171. 166. 144. 97. 168. 68. 49. 68. 245.
184. 202. 137. 85. 131. 283. 129. 59. 341. 87. 65. 102.
265. 276. 252. 90. 100. 55. 61. 92. 259. 53. 190. 142.
75. 142. 155. 225. 59. 104. 182. 128. 52. 37. 170. 170.
…omit…


做圖片的文字辨識
使用的是UCI的digits資料集,共有1797筆64pixel的灰階圖片
ex:
>>> from sklearn import datasets
>>> digits = datasets.load_digits()


boston房價
適合做regression
ex:
>>> from sklearn import datasets
>>> boston = datasets.load_boston()
>>> boston.data.shape
(506, 13)

refer
http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

讀取外部資料集 

載入外部資料
sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True,encoding=None, charset=None, charset_error=None, decode_error=’strict’, random_state=0)

ex:
from sklearn.datasets import load_files
load_files(container_path)
ps:
讀取外部資料也可用numpy loadtxt



…………………………………….

資料切割

載入資料後通常都會切成2部份做為training dataset和testing dataset,以下是常見的切法

資料切割
ex
倒數30筆做為test data,其餘為train data
>>> from sklearn import datasets
>>> diabetes = datasets.load_diabetes()
>>> print(diabetes.data.shape)
(442, 10)
>>> diabetes_train_data = diabetes.data[:-30]
>>> diabetes_train_target = diabetes.target[:-30]
>>> diabetes_test_data = diabetes.data[-30:]
>>> diabetes_test_target = diabetes.target[-30:]


資料切割by train_test_split
1
載入cross_validation

from sklearn.cross_validation import train_test_split
2
切割

d_train, d_test, t_train, t_test = train_test_split(data, target, test_size=0.2, random_state=0)
test_size 設定幾%資料為testdata,其餘為traindata
random_state 將亂數的狀態固定, 相同數值會讓每次執行時資料切割都會一樣,
ps:不設random_state資料會次都會打亂

ex:
from sklearn.cross_validation import train_test_split
train_data, test_data, train_target, test_target = train_test_split(iris.data,iris.target, test_size=0.2, random_state=0)
print 'train total:',len(train_target),', target:',sum(train_target)
print 'test total:',len(test_target),', target:',sum(test_target)

資料切割by ShuffleSplit
1
載入cross_validation

from sklearn.cross_validation import ShuffleSplit,cross_val_score
2
切割

cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1, random_state=0)
 test_size=0.1 設定幾%資料為testdata,其餘為traindata
 n_iter=3 做成 3 份,也就是 3-fold cv
 n_samples 幾份資料
3
根據clf 計算各別分數

test_scores = cross_val_score(clf, data, target, cv=cv, n_jobs=2)
 clf = < any classifier>
print test_scores

 
refer 
http://hhtucode.blogspot.tw/2013/10/python-ml-with-scikit-learn-model.html