To start, we will begin with a simulation example, similar to the ones we were working in for the simulations, which you can access from:
Let’s regenerate our working example data with some plotting code:
# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="",
xlabel="Covariate",
ylabel="Outcome (1st dimension)") {
data = data.frame(Y1=Ys[,1], Y2=Ys[,2],
Group=factor(Ts, levels=c(0, 1), ordered=TRUE),
Covariates=Xs)
data %>%
ggplot(aes(x=Covariates, y=Y1, color=Group)) +
geom_point() +
labs(title=title, x=xlabel, y=ylabel) +
scale_x_continuous(limits = c(-1, 1)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
Next, we will generate a simulation:
sim.simpl = cb.sims.sim_sigmoid(n=n, eff_sz=1, unbalancedness=1.5)
plot.sim(sim.simpl$Ys, sim.simpl$Ts, sim.simpl$Xs, title="Sigmoidal Simulation")
It looks like the red group/batch tends to have outcomes exceeding the blue group/batch. We can test whether a group/batch effect can be detected. See the vignette on causal conditional distance correlation for more details, found at:
Now to run the test and evaluate with \(\alpha = 0.05\):
result <- cb.detect.caus_cdcorr(sim.simpl$Ys, sim.simpl$Ts, sim.simpl$Xs, R=100)
print(sprintf("p-value: %.4f", result$Test$p.value))
#> [1] "p-value: 0.0099"
Since the \(p\)-value is \(< \alpha\), this indicates that the data provide evidence to reject the null hypothesis in favor of the alternative (there is a causal group/batch effect). Next, we can correct for this effect using causal conditional ComBat. Note in particular that the covariates should be passed as a named data frame:
cor.sim.simpl <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts,
data.frame(Covar=sim.simpl$Xs),
match.form="Covar")
#> Adjusting for1covariate(s) or covariate level(s)
We can plot the augmented data after matching conditional ComBat has been applied, like so:
plot.sim(cor.sim.simpl$Ys.corrected, cor.sim.simpl$Ts, cor.sim.simpl$Xs$Covar,
title="Sigmoidal Simulation (matching cComBat corrected)")
We can see that the augmented data nearly perfectly overlaps across the two groups/batches, which can be confirmed using the causal conditional distance correlation:
result.cor <- cb.detect.caus_cdcorr(cor.sim.simpl$Ys.corrected, cor.sim.simpl$Ts,
cor.sim.simpl$Xs$Covar, R=100)
print(sprintf("p-value: %.4f", result.cor$Test$p.value))
#> [1] "p-value: 0.9109"
Since the \(p\)-value is \(> \alpha\), the data provide no evidence to reject the null hypothesis (that there is a causal group/batch effect) in favor of the alternative. Stated another way, the group/batch effect that was present before matching conditional ComBat correction is no longer detectable.
If we have multiple covariate values, you can specify custom formulas for the matching process using the standard R formula notation. For instance, if we had a second covariate value:
You could specify a formula instead like this:
cor.sim <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, Xs.2covar,
match.form="Covar1 + Covar2")
#> Adjusting for2covariate(s) or covariate level(s)
You can also use the standard MatchIt::matchit()
arguments to leverage custom matching algorithms. These are passed into
the function using the match.args
argument. For instance,
consider if we have a third categorical covariate:
If we wanted to leverage exact matching for the categorical
covariate, we could specify to perform exact matching for
Cat.Covar
, and leave nearest-neighbor matching (without
replacement) for Covar1
and Covar2
, exactly as
you would for the MatchIt::matchit()
package:
match.args <- list(method="nearest", exact="Cat.Covar", replace=FALSE,
caliper=0.1)
cor.sim <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, Xs.3covar,
match.form="Covar1 + Covar2 + Cat.Covar",
match.args=match.args)
#> Warning: Fewer control units than treated units in some `exact` strata; not all
#> treated units will get a match.
#> Adjusting for3covariate(s) or covariate level(s)
Intuitively, the causal pre-processing procedures (including propensity trimming and matching) provide a form of bias mitigation from certain covariates, similar to biases due to model misspecifications, when these covariates are asymmetrically distributed across batches. These steps can be visualized by comparing the covariate distributions before and after batch effect correction; e.g., via a histogram:
# a function for plotting a histogram plot of the covariates
plot.covars <- function(Ts, Xs, title="",
xlabel="Covariate",
ylabel="Number of Samples") {
data = data.frame(Covariates=as.numeric(Xs),
Group=factor(Ts, levels=c(0, 1), ordered=TRUE))
data %>%
ggplot(aes(x=Covariates, color=Group, fill=Group)) +
geom_histogram(position="identity", alpha=0.5) +
labs(title=title, x=xlabel, y=ylabel) +
scale_x_continuous(limits = c(-1, 1)) +
scale_y_continuous(limits=c(0, 12)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
scale_fill_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
ggarrange(plot.covars(sim.simpl$Ts, sim.simpl$Xs, title="(A) Unfiltered samples"),
plot.covars(cor.sim.simpl$Ts, cor.sim.simpl$Xs$Covar,
title="(B) Matched + Trimmed samples"),
nrow=2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
Note that the number of samples retained (the sum of the counts) is much lower after matching and trimming.
Batch effect corrections learned via
cb.correct.matching_cComBat()
are reasonably interpretable
for samples within a range of covariate overlap across the batches
included in the correction, which is more general than a fully matching
subset (used for the estimation procedure). This means that it may be
reasonable to apply learned corrections to certain out-of-sample data
which have similar covariates to the included samples, even if these
samples are not part of the fully matched subset used directly for
estimation of the batch effect correction model. To do so, we can use
the apply.oos
argument, which has the effect of learning
batch effects from a fully matched subset of points, and then applying
the learned batch effects to a less stringent covariate-overlapping
subset of samples. This therefore has the effect of learning the batch
effect correction from a subset of samples chosen to reduce bias, and
then applies it to a wider subset of samples which are similar to the
samples used for bias reduction.
cor.sim.oos <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, data.frame(Covar=sim.simpl$Xs),
match.form="Covar", apply.oos=TRUE)
#> Adjusting for1covariate(s) or covariate level(s)
ggarrange(plot.covars(cor.sim.oos$Ts, cor.sim.oos$Xs$Covar,
title="(A) In- and out-of-sample data"),
plot.sim(cor.sim.oos$Ys.corrected, cor.sim.oos$Ts, cor.sim.oos$Xs$Covar,
title="(B) matching cComBat on in- and out-of-sample data"),
nrow=2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
This applies matching cComBat to the in-sample points (the fully matching subset across all batches, all of whom together are chosen such that the batches are covariate balanced, or the covariate distributions are rendered approximately equal, across batches) as well as out-of-sample points (data from individual batches who fall within a range of covariate overlap across all batches).
This behavior can be tuned directly via the
cb.correct.apply_cComBat()
function. We caution individuals
from using this function directly without performing steps to ensure
that the subset of samples the batch effect correction is learned from
and the subset of samples to whom the batch effect correction are
applied have similar covariate ranges. The code below computes the
points within a range of covariate overlap across all batches using
vertex matching.
oos.ids <- cb.align.vm_trim(sim.simpl$Ts, sim.simpl$Xs)
Ys.oos <- sim.simpl$Ys[oos.ids,,drop=FALSE]; Ts.oos <- sim.simpl$Ts[oos.ids]
Xs.oos <- sim.simpl$Xs[oos.ids,,drop=FALSE]
Ys.oos.cor <- cb.correct.apply_cComBat(Ys.oos, Ts.oos, data.frame(Covar=Xs.oos),
cor.sim.oos$Model)
plot.sim(Ys.oos.cor, Ts.oos, Xs.oos, title=" matching cComBat applied to OOS data")