目录

python库

2020-03-23 约 613 字预计阅读 2 分钟收录于机器学习

目录

numpy

数据类型转换 astype

arr = np.array([1,2,3])
arr.dtype #数据类型查看
float_arr = arr.astype(np.float32)
float_arr.dtype

pandas

read_csv

pd.read_csv(path, header=, name=)
header=0 表示以第一行为标题，若原文件没有标题，则应传入name列表

数据索引

行：data.loc[行索引名称]、data.iloc[行号] 列：data[列索引名称]

DataFrame画图

其数据结构DataFrame类似于excel，是一种二维表。而字典结构也类似于二维表，所以用pandas来对字典数据进行作图非常方便，如下：

pd.DataFrame(dict).plot()

pandas-profiling

train_data_profiling = pandas_profiling.ProfileReport(train_data, config_file="D:/document/Python/jupyter/pandas-profiling-config/config_minimal.yaml")
train_data_profiling.to_file("report.html")

config_minimal.yaml config_default.yaml config_explorative.yaml

sklearn

划分数据集

按照一定比例随机划分数据集

from sklearn.model_selection import train_test_split
# 默认按照3:1进行拆分，可设置test_size参数进行改变，其默认为0.25
x_train_all, x_test, y_train_all, y_test = train_test_split(housing.data, housing.target, random_state = 7, test_size = 0.25)
x_train, x_valid, y_train, y_valid = train_test_split(x_train_all, y_train_all, random_state = 11)

数据标准化

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fit_transform先进行fit(拟合)，计算出数据的整体指标，如均值、方差、最大值最小值等等，然后
# 再进行transform(转换)，进行标准化等  
x_train_scaled = scaler.fit_transform(
    # 因为输入的数据为int型，在计算过程中相除的话会导致截断，所以先转换称浮点类型
    # 又待标准化数据为x_train:[1000, 28, 28]，因为fit_transform的输入参数为1行n列或n行1列的二维数组，所以reshape一下
    x_train.astype(np.float32).reshape(1,-1)
).reshape(-1, 28, 28) # 最后将数组转换回之前的三维

# 对测试集进行标准化处理，注意要用训练集的均值和方差，因为之前的fit_transform分为两步fit和transform
# 而其中fit的数据保存了下来，所以对测试集只需transform就可以了
x_test_scaled = scaler.transform(x_test.astype(np.float32).reshape(1,-1)).reshape(-1, 28, 28)