Data Preprocessing | 牛的大腦

data preprocessing
常見任務有:
data cleaning:解決dirty data
data integration:將多個來源資料整合
data transformation:正規化,將所有資料調到固定範圍
data reduction:將相同的資料刪掉,或將資料取樣
data discretization and concept hierarchy generation

……………………………………………………………………………………………………..

data cleaning
data cleaning是資料倉儲三大問題的其中一個,也是最重要的問題

data in the real world is dirty
因為data會包括
　incomplete data
　noisy data
　inconsistent data
　duplicate records
ps:
no quality data,no quality mining results

為解決上述dirty data,主要任務如下
處理missing data：有些資料有遺漏,資料不完整
處理noisy data：資料有問題,可能在轉檔時或輸入時錯誤
偵測data discrepancy

偵測data discrepancy方法主要有
　使用metadata:透過domain,range,dependency,distribution的計算
　檢查field overloading
　檢查uniqueness rule,consecutive rule , null rule
　使用商業工具做data scrubbing及data auditing

處理missing data主要方式如下
ignore the tuple:忽略掉這類資料
fill by manually:用手動填值方式補值
fill by automatically:根據指定規則自動填值,常見規則如下
　填入unknown,以視為另一個值
　填入attribute mean的值
　　by all:missing data以該attribute mean代替
　　by class:將資料分為不同class,該class的missing data以該class 的attribute mean代替
　填入bayesian formula或decision tree產生的值

處理noisy data主要方式如下
binning
regression(迴歸分析)
clustering:找出將outlier value並排除

ps:
binning(切割法)
1 sort data and partition into bins
切割成M份時,有以下兩種處理方式
equal-depth(frequency):每M份的資料量要相同(數值寬度可不同)
euqal-width(distance):每M份的數值寬度要相同(資料量可不同)
2 smooth
將中間的值透過mean,median,boundaries,…等方法重新計算
ex:
data set:4,8,9,15,21,21,24,25,26,28,29,34
1 partition into bins
若要用equal-width切3份
則每份的數值寬度需為(34-4+1)/12=10.3
　bin1:4,8,9
　bin2:15,21,21,24
　bin3:25,26,28,29,34
2 smooth
smooth by mean
　bin1:8,8,8
　bin2:21,21,21,21
　bin3:28,28,28,28,28
smooth by boundaries
　bin1:4,9,9
　bin2:15,24,24,24
　bin3:25,25,25,25,34

……………………………………………

data integration
combines data from multiple sources into a coherent store

整合可能會碰到問題
entity identification:不同名稱的資料但代表的意思是相同的,ex:tw=taiwan
不同單位:在integration要轉成相同單位 , ex:byte vs bit
redundancy data:常見有以下
　object identification:相同attribute但在不同來源有不同名稱 ex:欄位srcaddr = 欄位source 　
　derivable data:該attribute的資料可被另一個attribute所推導 ex: 欄位bpp = 欄位byte/欄位packet
ps:
redundancy data可透過correlation analysis偵測出來
若是numerical data可以用correlation coefficient偵測出來
若是categorical data可以用chi-square test偵測出來

ps:
chi-square test
x2=segmal( (observed-expected)^2/expected )
x2的值越大相關性越高
ex:
有一假設宣稱p2p的使用和是否中毒沒關係(H0)
為了檢定一假設,實際抽樣調查結果,observed如下

	使用p2p	不使用p2p	sum(row)
中毒	250(90)	200(360)	450
沒中毒	50(210)	1000(840)	1050
sum(column)	300	1200	1500

ps:
expected(90)=450*300/1500
x2=(250-90)^2/90+(200-360)^2/360+(50-210)^2/210+(1000-840)^2/840=507.93
若要拒絕H0(推翻原本假設的宣稱)則需大於顯著水準,
若顯著水準為0.001,在自由度1的情況下為10.828
ps:
自由度=(row-1)(column-1)=(2-1)(2-1)=1
x2>10.828,拒絕H0推翻原假設所宣稱,因此換句話說本次抽樣說明p2p使用和中毒是有關係的

……………………………………

data transformation
正規化,將所有資料調到固定範圍

常見的任務有
aggregation/summary:一般建立data cube也會包含這個部份
　ex:每天的銷售被聚集成一個月或一年的銷售資料
　ex:將每天的流量資料aggregation成每月或每週的資料
　ex:將flow的數據aggregation成flowg的數據,在aggregation成ip的數據
generalization:concept hierarchy climbing ,
　ex:categorical如住址可以被拆成城市,道路,門牌號碼
　ex:numerical如年齡可以被拆成年輕人,中年人,長者
　ex:ip,ip所在的lan,該lan所在的國家
normalization:將數值轉換成一個range,常見方法如下
　min-max normalization:公式為((v-min)/( max-min)) * (new_max-new_min)+new_min
　z-score normalization:透過統計方式將數據轉成z-score,公式為 (v-mean)/s
　normalization by decimal scaling(不建議用)
ps:min-max normalization無法找到outlier，因為被分佈在0-1之間
ex:
byte min=12000,byte max=98000,
xi byte=73600,使用min-max normalization轉換成0-1之間的數值
則新值=((73600-12000)/( 98000-12000)) * (1-0)+0=0.71627907

……………………………………………

data reduction
將相同的資料刪掉,或將資料取樣

常見策略有
data cube aggregation
dimensionality reduction :
data compression:通常用於多媒體
numerosity reduction
discretization and concept hierarchy generation:將raw data轉換成higher conceptual level

dimensionality reduction
目地:移除不重要的dimensionality,找出所有集合中達到目標的最小集合
常用方法包括
decision-tree induction
heuristic feature selection method
principal component analysis:將數值型資料變成向量,用在維度太大時,主要是從n個維度中找出k個資料最佳代表的維度

numerosity reduction
目地:將數據量變少
parametric method:
　包括linear regression,multiple regression,log-linear model
non-parametric method:包括
　histograms:將資料切割,每份在算出平均數來代表資料
　　常見切割方法包括:equal-width,equal-frequency,v-optimal(用變異數切割),maxdiff
　clustering:資料以群表示,將outlier的資料排除
　sampling:選擇有代表性的資料
　　SRSWOR:抽樣越多,有限母體資料會越來越少
　　SRSWR:抽樣過程中,有限母體保持不變
　　stratified sampling:先分群,在依各群比例取樣
　　cluster sampling:先分群,在取其中幾群做為樣本

……………………….

data discretization and concept hierarchy generation
concept hierarchy
　將low level資料以high level資料取代 ex:年齡數據以{老,少,幼}3個value表示
discretization
　將continuous資料分成n個區間
　用途:可用來data reduction，或做更進一步的分析
　依是否有訓練資料可分為 supervised 和 unsupervised
　也可分為 split(top-down) 和 merge(bottom-up)
ps:
3種屬型類型
nominal(名目):values from an unordered set ,ex:color,protocol
ordinal(有排名的):values from an ordered ,ex:virus rank
continuous(實數的):real number ,ex: byte

用於numerical data
top-down split,
　binning:屬於unsupervised
　histogram analysis: 屬於unsupervised
　natural partitioning/3-4-5 rule:屬於unsupervised
　entropy-based discretization(複雜度分析):先假設一個切割點,在計算該點左右兩邊的值以評估該點是否屬於好的分割點,反覆作業直到找到最好的分割點,屬於supervised
bottom-up merge
　chi-squre test analysis:屬於unsupervised
top-down split or bottom-up merge
　clustering analysis:屬於unsupervised

用於categorical data
by schema level
ex:表達address欄位有country,city,street,則根據判斷架構應為(low-level)street < city < country (high-level)
by explicit data grouping
ex:{ 10.1.0.0/16 , 10.2.0.0/16 }屬於sales network,{ 172.16.1.0/24 , 172.16.2.0/24 }屬於research network,
計算每個維度的distinct value(不重複的值)
distinct value越多的維度通常會放在low-level,distinct value越少的維度通常會放在high-level
ex: 網路位置由 city,ip表達,distinct value分別得到為30和1000,則city屬於high-level,ip屬於low-level

分享此文：