这里我将使用函数自带的数据集进行演示；使用时只需将数据换成你们需要填充的numpy array 格式的数据集就可以

在加载自带数据之前，你需要下载数据并将它放到函数库所在目录下:

下载数据，linux用户可使用wget下载，windows用户可以复制里面的链接进行下载：

wget https://github.com/HCMY/ycimpute/raw/master/test_data/boston.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/iris.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/wine.hdf5

数据文件将下载在你当前的工作目录下，然后把它们放到这里：

你的python目录/site-packages/ycimpute/datasets/把它们放到这里

首先来看一下如何加载自带数据：

加载数据使用加载器load_data, 具体的数据加载函数将返回两份数据：一份完整的数据和对应的缺失数据

from ycimpute.datasets import load_data
boston_missing, boston_full = load_data.load_boston()
print(boston_missing)
print(boston_full)

[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00              nan ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02              nan   7.07000000e+00 ...,              nan
    3.92830000e+02              nan]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]
[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.92830000e+02   4.03000000e+00]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]

来看一下如何选择一种填充方法进行填充：

这里以Iterforest为例：

使用填充类IterImpute()，在其构造器里填入所需参数，回调complete()完成填充

from ycimpute.imputer import IterImput
X = boston_missing
complete_X = IterImput().complete(X)
print(complete_X)

[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00   5.46000000e+00 ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02   1.75000000e+01   7.07000000e+00 ...,   1.82000000e+01
    3.92830000e+02   6.28300000e+00]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]

如何评价填充结果的好坏？

严格来说，真实场景中的填充效果无法迅速评价其好坏，除非你知道答案并对效果进行测试，或者看填充后的完整数据在你的模型上的表现。

我们采用前一种方式，用一份完整的数据和一份对应的含有缺失值的数据做实验：

from ycimpute.utils import evaluate
from ycimpute.utils import config
from ycimpute.utils.tools import Solver
#取得缺失的数据
solver = Solver()
mask_all = solver.masker(boston_missing)[config.all]
missing_index = evaluate.get_missing_index(mask_all)
original_arr = boston_full[missing_index]
#取得填充后的数据
iterforest_filled_arr = complete_X[missing_index]
#评价
rmse_iterforest_score = evaluate.RMSE(original_arr, iterforest_filled_arr)
print(rmse_iterforest_score)

23.090788716

为了选择到更好的方法，这里提供了一次性获得所有填充方法效果的接口，目前暂支持连续形缺失值评价的rmse评价函数

from ycimpute.utils.shower import show
import pandas as pd
boston_result = show.analysiser(missing_X=boston_missing, original_X=boston_full)
boston_result = pd.DataFrame.from_dict(boston_result,orient='index')
print(boston_result)

                                0
rmse_em_score           30.405419
rmse_median_score       57.616702
rmse_mean_score         52.154860
rmse_knn_score          40.944330
rmse_zero_score        159.534384
rmse_iterforest_score   23.741541
rmse_mice_score         27.914184
rmse_min_score         127.874980

快速开始

这里我将使用函数自带的数据集进行演示；使用时只需将数据换成你们需要填充的numpy array 格式的数据集就可以

results matching ""

No results matching ""