Here I'm going to use the library comes with the data to demonstrate. When you use it in your codes, only you need to do is that replaced the numpy.array-format data you need to fill.

Before using the example, you need to download the data file and copy it to the function directory of your python package.

Download data, Linus users could get them via "wget". Windows users could use http address in there

wget https://github.com/HCMY/ycimpute/raw/master/test_data/boston.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/iris.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/wine.hdf5

After download procss completed, the data will be saved in your current direction. Then put them at here:

your python path /site-packages/ycimpute/datasets/put them at here

First of all, Lets take a look how to load your own data:

Load data using the loader load_data The specific data load funtion will return two copies of data: one is a complete data, another is the corresponding missing data.

from ycimpute.datasets import load_data

boston_missing, boston_full = load_data.load_boston()
print(boston_missing)
print(boston_full)

[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00              nan ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02              nan   7.07000000e+00 ...,              nan
    3.92830000e+02              nan]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]
[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02   0.00000000e+00   7.07000000e+00 ...,   1.78000000e+01
    3.92830000e+02   4.03000000e+00]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]

Take a look that how to select a method to padding the missing target:

Here I use IterForest as example.

Use fill class IterImpute (), input in the required parameters in its constructor, callback complete () to complete the fill process.

from ycimpute.imputer import IterImput

X = boston_missing
complete_X = IterImput().complete(X)
print(complete_X)

[[  6.32000000e-03   1.80000000e+01   2.31000000e+00 ...,   1.53000000e+01
    3.96900000e+02   4.98000000e+00]
 [  2.73100000e-02   0.00000000e+00   5.46000000e+00 ...,   1.78000000e+01
    3.96900000e+02   9.14000000e+00]
 [  2.72900000e-02   1.75000000e+01   7.07000000e+00 ...,   1.82000000e+01
    3.92830000e+02   6.28300000e+00]
 ..., 
 [  6.07600000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   5.64000000e+00]
 [  1.09590000e-01   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.93450000e+02   6.48000000e+00]
 [  4.74100000e-02   0.00000000e+00   1.19300000e+01 ...,   2.10000000e+01
    3.96900000e+02   7.88000000e+00]]

How to evaluate the filling result is good or bad?

Strictly speaking, the filling effect in a real scene can not be evaluated quickly, unless you know the answer and test the effect, or look at the performance of the filled whole data on your model.

We use the former method to experiment with a complete set of data and a corresponding data with missing values:

from ycimpute.utils import evaluate

from ycimpute.utils import config
from ycimpute.utils.tools import Solver
#get the missing data
solver = Solver()
mask_all = solver.masker(boston_missing)[config.all]
missing_index = evaluate.get_missing_index(mask_all)
original_arr = boston_full[missing_index]
#get the filled data
iterforest_filled_arr = complete_X[missing_index]
#evaluate
rmse_iterforest_score = evaluate.RMSE(original_arr, iterforest_filled_arr)
print(rmse_iterforest_score)

23.090788716

In order to choose a better approach, There is a one-time interface to all the effects of fill method which currently supports the RMSE evaluation function for continuous data.

from ycimpute.utils.shower import show

import pandas as pd
boston_result = show.analysiser(missing_X=boston_missing, original_X=boston_full)
boston_result = pd.DataFrame.from_dict(boston_result,orient='index')
print(boston_result)

                                0
rmse_em_score           30.405419
rmse_median_score       57.616702
rmse_mean_score         52.154860
rmse_knn_score          40.944330
rmse_zero_score        159.534384
rmse_iterforest_score   23.741541
rmse_mice_score         27.914184
rmse_min_score         127.874980

Quick Start

Here I'm going to use the library comes with the data to demonstrate. When you use it in your codes, only you need to do is that replaced the numpy.array-format data you need to fill.

First of all, Lets take a look how to load your own data:

Take a look that how to select a method to padding the missing target:

How to evaluate the filling result is good or bad?

In order to choose a better approach, There is a one-time interface to all the effects of fill method which currently supports the RMSE evaluation function for continuous data.

results matching ""

No results matching ""