Python pandas access

控制dataframe

……………………………………

外部資料讀取與寫入

讀檔
pd.read_csv()
ex:
#cat data.csv
date,value
2016,100
2015,90
2014,50
>>>data=pd.read_csv(‘data.csv’)
>>>data.date
0 2016
1 2015
2 2014
ex:
pd.read_csv(‘ch09/stock_px.csv’, parse_dates=True, index_col=0)
ex:
temp_df = pd.read_csv(‘file.csv’, header=None, names=[‘FileID’, ‘CustomerID’, ‘QueryTS’, ‘ProductID’])


儲存
pd.to_csv()
ex:
pd.to_csv(“analysis_result.csv”)

……………………………………

資料轉換

轉換
pandas to series
.series()
ex:
ts = pd.Series(df[‘Value’].values, index=df[‘Date’])

pandas to numpy
.values
ex:
npdata=df.values

pandas to list
.to_list()
ex:
listdata=df.to_list()

dict to pandas
ex:
groups = [“Web”, “Dev”, “Cloud”, “Data”, “Security”, “group_p”]
groupvalue = [5, 9, 1, 4, 6, 7]
data_dict = {“groups”: groups,”value”: groupvalue}
data_df = pd.DataFrame(data_dict)

list to pandas
ex:
listfieldlen = [59, 9, 19, 14, 6, 77]
df_fieldcount = pd.DataFrame()
df_fieldcount[‘fieldcount’]=listfieldlen
ex:
listfieldlen = [59, 9, 19, 14, 6, 77]
df_listfieldlen = pd.DataFrame(listfieldlen)

refer
https://ithelp.ithome.com.tw/articles/10185922

……………………….

資料操作


設定欄位型別
.astype( < type> )
ex:
df[“Department”] =df[“Department”].astype(“category”)

刪除指定的欄位
.drop( < field name > )
ex:
data=data.drop(data.columns[0],axis=1)
data=data.drop(“column name”,axis=1)
ex:
#刪除指定欄位
> datatrain.drop([‘agenttime’],axis=1,inplace=True)


分割欄位資料
.str.split( < char> )
ex:
data3[‘URL’].str.split(‘?’, expand=True)

合併欄位
#透過合併欄位建立新欄位
ex:
> datatrain[‘ip’] = datatrain[‘ipv4′] + datatrain[‘ipv6′]

以累積方式合併資料
.append()
ex:
datatrain=train1.append(train2, ignore_index=True)


水平合併
.concat
ex:
datanew = pd.concat([data1,data2],axis=1, join_axes=[data1.index])

… 

將空值(NaN)補0
.fillna()
ex:
df.fillna(0)
ex:
# 將空值補滿
> datatrain[‘account’].fillna(“null”, inplace=True)

取代
.replace()
此方法需先透過dictionary將字串變成代號
ex:
> dict_content={‘Tcp’:0,’TCP’:0,’udp’:1,’UDP’:1}
> datatrain[‘protocol’].replace(dict_content,regex=True)

將資料前後空白刪除
df.str.strip()

轉換時間格式
to_datetime( < time>, format=” < time format>”)
ex:
pd.to_datetime(“6 Mar 2019 00:54:40”,format=’%d %b %Y %H:%M:%S’)
>2019-03-06 00:54:40