To begin, we will create some plotting code. This code will take a vector of covariate values, and generate a rugplot along with histograms for the covariate values of each group/batch.
plot.covars <- function(Xs, Ts, title="", xlabel="Covariate",
ylabel="Density") {
data.frame(Batch=factor(Ts, levels=c(0, 1)), Covariate=Xs) %>%
ggplot(aes(x=Covariate, group=Batch, color=Batch)) +
geom_rug() +
geom_histogram(aes(fill=Batch), binwidth=0.1, position="identity",
alpha=0.5) +
labs(title=title, x=xlabel, y=ylabel) +
scale_x_continuous(limits=c(-1, 1)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
scale_fill_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
generate some simulated data which is imbalanced, and some code to plot the covariates for the simulated data along with kernel density estimates of the covariates:
sim.low <- cb.sims.sim_linear(n=n, unbalancedness=2)
plot.covars(sim.low$Xs, sim.low$Ts, title="Sample covariate values")
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
Note particularly that there are many samples in group/batch \(0\) with covariate values much smaller than the smallest attained by samples in group/batch \(1\), and there are many samples in group/batch \(1\) with covariate values much larger than the largest attained by samples in group/batch \(2\).
Conceptually, vector matching can be thought of as a form of “propensity trimming”; that is, it will remove samples from a given group/batch which are dissimilar from one (or more) other groups/batches on the basis of their propensity scores. This is a relatively coarse approach to balancing covariates across the groups/batches:
vm.retained <- cb.align.vm_trim(sim.low$Ts, sim.low$Xs)
plot.covars(sim.low$Xs[vm.retained], sim.low$Ts[vm.retained],
title="Sample covariate values (after VM)")
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
Note that the covariate values attained by the two groups are now overlapping; that is, there are no longer covariates in individual groups/batches that are larger/smaller than the largest/smallest attained by the other group/batch.
Conceptually, \(K\)-way matching can be thought of as a way to directly include/exclude samples from across the groups/batches until the covariate distributions per group/batch are approximately rendered equal. This is a relatively restrictive approach to aligning covariates across the groups/batches:
kway.retained <- cb.align.kway_match(sim.low$Ts, data.frame(Covar=sim.low$Xs),
match.form="Covar")$Retained.Ids
plot.covars(sim.low$Xs[kway.retained], sim.low$Ts[kway.retained],
title="Sample covariate values (after K-way matching)")
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
In this case, we can see that the empirical covariate values retained after \(K\)-way matching are almost identical across the two groups.
Typically, vector matching will tend to retain more samples for subsequent analysis than k-way matching. This may be undesirable if subsequent inference/estimation techniques are known to be sensitive to unequal empirical covariate distributions.