Here I'm going to use the library comes with the data to demonstrate. When you use it in your codes, only you need to do is that replaced the numpy.array-format data you need to fill.
Before using the example, you need to download the data file and copy it to the function directory of your python package.
Download data, Linus users could get them via "wget". Windows users could use http address in there
wget https://github.com/HCMY/ycimpute/raw/master/test_data/boston.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/iris.hdf5
wget https://github.com/HCMY/ycimpute/raw/master/test_data/wine.hdf5
After download procss completed, the data will be saved in your current direction. Then put them at here:
your python path /site-packages/ycimpute/datasets/put them at here
First of all, Lets take a look how to load your own data:
Load data using the loader load_data
The specific data load funtion will return two copies of data: one is a complete data, another is the corresponding missing data.
from ycimpute.datasets import load_data
boston_missing, boston_full = load_data.load_boston()
print(boston_missing)
print(boston_full)
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 nan ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 nan 7.07000000e+00 ..., nan
3.92830000e+02 nan]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01
3.92830000e+02 4.03000000e+00]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
Take a look that how to select a method to padding the missing target:
Here I use IterForest as example.
Use fill class IterImpute (), input in the required parameters in its constructor, callback complete () to complete the fill process.
from ycimpute.imputer import IterImput
X = boston_missing
complete_X = IterImput().complete(X)
print(complete_X)
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01
3.96900000e+02 4.98000000e+00]
[ 2.73100000e-02 0.00000000e+00 5.46000000e+00 ..., 1.78000000e+01
3.96900000e+02 9.14000000e+00]
[ 2.72900000e-02 1.75000000e+01 7.07000000e+00 ..., 1.82000000e+01
3.92830000e+02 6.28300000e+00]
...,
[ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 5.64000000e+00]
[ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.93450000e+02 6.48000000e+00]
[ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01
3.96900000e+02 7.88000000e+00]]
How to evaluate the filling result is good or bad?
Strictly speaking, the filling effect in a real scene can not be evaluated quickly, unless you know the answer and test the effect, or look at the performance of the filled whole data on your model.
We use the former method to experiment with a complete set of data and a corresponding data with missing values:
from ycimpute.utils import evaluate
from ycimpute.utils import config
from ycimpute.utils.tools import Solver
#get the missing data
solver = Solver()
mask_all = solver.masker(boston_missing)[config.all]
missing_index = evaluate.get_missing_index(mask_all)
original_arr = boston_full[missing_index]
#get the filled data
iterforest_filled_arr = complete_X[missing_index]
#evaluate
rmse_iterforest_score = evaluate.RMSE(original_arr, iterforest_filled_arr)
print(rmse_iterforest_score)
23.090788716
In order to choose a better approach, There is a one-time interface to all the effects of fill method which currently supports the RMSE evaluation function for continuous data.
from ycimpute.utils.shower import show
import pandas as pd
boston_result = show.analysiser(missing_X=boston_missing, original_X=boston_full)
boston_result = pd.DataFrame.from_dict(boston_result,orient='index')
print(boston_result)
0
rmse_em_score 30.405419
rmse_median_score 57.616702
rmse_mean_score 52.154860
rmse_knn_score 40.944330
rmse_zero_score 159.534384
rmse_iterforest_score 23.741541
rmse_mice_score 27.914184
rmse_min_score 127.874980