这里我将使用函数自带的数据集进行演示;使用时只需将数据换成你们需要填充的numpy array 格式的数据集就可以
在加载自带数据之前,你需要下载数据并将它放到函数库所在目录下:
下载数据,linux用户可使用wget下载,windows用户可以复制里面的链接进行下载:
wget https://github.com/HCMY/ycimpute/raw/master/test_data/boston.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/iris.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/wine.hdf5
数据文件将下载在你当前的工作目录下,然后把它们放到这里:
你的python目录/site-packages/ycimpute/datasets/把它们放到这里
首先来看一下如何加载自带数据:
加载数据使用加载器load_data
, 具体的数据加载函数将返回两份数据:一份完整的数据和对应的缺失数据
from ycimpute.datasets import load_data
boston_missing, boston_full = load_data.load_boston()
print(boston_missing)
print(boston_full)
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 nan ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 nan 7.07000000e+00 ..., nan
3.92830000e+02 nan]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01
3.92830000e+02 4.03000000e+00]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
来看一下如何选择一种填充方法进行填充:
这里以Iterforest为例:
使用填充类IterImpute(),在其构造器里填入所需参数,回调complete()完成填充
from ycimpute.imputer import IterImput
X = boston_missing
complete_X = IterImput().complete(X)
print(complete_X)
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 5.46000000e+00 ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 1.75000000e+01 7.07000000e+00 ..., 1.82000000e+01
3.92830000e+02 6.28300000e+00]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
如何评价填充结果的好坏?
严格来说,真实场景中的填充效果无法迅速评价其好坏,除非你知道答案并对效果进行测试,或者看填充后的完整数据在你的模型上的表现。
我们采用前一种方式,用一份完整的数据和一份对应的含有缺失值的数据做实验:
from ycimpute.utils import evaluate
from ycimpute.utils import config
from ycimpute.utils.tools import Solver
#取得缺失的数据
solver = Solver()
mask_all = solver.masker(boston_missing)[config.all]
missing_index = evaluate.get_missing_index(mask_all)
original_arr = boston_full[missing_index]
#取得填充后的数据
iterforest_filled_arr = complete_X[missing_index]
#评价
rmse_iterforest_score = evaluate.RMSE(original_arr, iterforest_filled_arr)
print(rmse_iterforest_score)
23.090788716
为了选择到更好的方法,这里提供了一次性获得所有填充方法效果的接口,目前暂支持连续形缺失值评价的rmse评价函数
from ycimpute.utils.shower import show
import pandas as pd
boston_result = show.analysiser(missing_X=boston_missing, original_X=boston_full)
boston_result = pd.DataFrame.from_dict(boston_result,orient='index')
print(boston_result)
0
rmse_em_score 30.405419
rmse_median_score 57.616702
rmse_mean_score 52.154860
rmse_knn_score 40.944330
rmse_zero_score 159.534384
rmse_iterforest_score 23.741541
rmse_mice_score 27.914184
rmse_min_score 127.874980