For this example, we will use the UnsupervisedTreatment, but the same parameters can be used with the other treatment plans as well.
Here we create a simple data set where the inputs have missing values.
library(vtreat)## Loading required package: wrapr
d = data.frame(
"x" = c(0, 1, 1000, NA),
"w" = c(3, 6, NA, 100),
"y" = c(0, 0, 1, 1)
)
knitr::kable(d)| x | w | y |
|---|---|---|
| 0 | 3 | 0 |
| 1 | 6 | 0 |
| 1000 | NA | 1 |
| NA | 100 | 1 |
Some of the summary statistics of d. We're primarily interested in the inputs x and w.
summary(d)## x w y
## Min. : 0.0 Min. : 3.00 Min. :0.0
## 1st Qu.: 0.5 1st Qu.: 4.50 1st Qu.:0.0
## Median : 1.0 Median : 6.00 Median :0.5
## Mean : 333.7 Mean : 36.33 Mean :0.5
## 3rd Qu.: 500.5 3rd Qu.: 53.00 3rd Qu.:1.0
## Max. :1000.0 Max. :100.00 Max. :1.0
## NA's :1 NA's :1
By default, vtreat fills in missing values with the mean value of the column, and adds an advisory *_is_bad column to mark the location of the original missing values.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0.0000 | 0 | 3.00000 | 0 | 0 |
| 1.0000 | 0 | 6.00000 | 0 | 0 |
| 1000.0000 | 0 | 36.33333 | 1 | 1 |
| 333.6667 | 1 | 100.00000 | 0 | 1 |
If you do not want to use the mean to fill in missing values, you can change the imputation function using the parameter missingness_imputation. Here, we fill in missing values with the median.
median2 <- function(x, wts) {
median(x)
}
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = median2)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | 6 | 1 | 1 |
| 1 | 1 | 100 | 0 | 1 |
You can also use a constant value instead of a function. Here we replace missing values with the value -1.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = -1)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | -1 | 1 | 1 |
| -1 | 1 | 100 | 0 | 1 |
You can control the imputation strategy per column via the map imputation_map. Any column not named in the imputation map will use the imputation strategy specified by the missingness_imputation parameter (which is the mean by default).
Here we use the maximum value to fill in the missing values for x and the value 0 to fill in the missing values for w.
max2 <- function(x, wts) {
max(x)
}
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
imputation_map = list(
x = max2,
w = 0
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | 0 | 1 | 1 |
| 1000 | 1 | 100 | 0 | 1 |
If we don't specify a column, vtreat looks atmissingness_imputation (in this case, -1).
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = -1,
imputation_map = list(
x = max2
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | -1 | 1 | 1 |
| 1000 | 1 | 100 | 0 | 1 |
If missingness_imputation is not specified, vtreat uses a weighted mean.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
imputation_map = list(
x = max2
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3.00000 | 0 | 0 |
| 1 | 0 | 6.00000 | 0 | 0 |
| 1000 | 0 | 36.33333 | 1 | 1 |
| 1000 | 1 | 100.00000 | 0 | 1 |