Matching Conditional ComBat

Eric W. Bridgeford

2025-01-07

require(causalBatch)
require(ggplot2)
require(ggpubr)
require(tidyr)
n = 200

To start, we will begin with a simulation example, similar to the ones we were working in for the simulations, which you can access from:

vignette("cb.simulations", package="causalBatch")

Let’s regenerate our working example data with some plotting code:

# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="", 
                     xlabel="Covariate",
                     ylabel="Outcome (1st dimension)") {
  data = data.frame(Y1=Ys[,1], Y2=Ys[,2], 
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE), 
                    Covariates=Xs)
  
  data %>%
    ggplot(aes(x=Covariates, y=Y1, color=Group)) +
    geom_point() +
    labs(title=title, x=xlabel, y=ylabel) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}

Next, we will generate a simulation:

sim.simpl = cb.sims.sim_sigmoid(n=n, eff_sz=1, unbalancedness=1.5)

plot.sim(sim.simpl$Ys, sim.simpl$Ts, sim.simpl$Xs, title="Sigmoidal Simulation")

It looks like the red group/batch tends to have outcomes exceeding the blue group/batch. We can test whether a group/batch effect can be detected. See the vignette on causal conditional distance correlation for more details, found at:

vignette("cb.detect.caus_cdcorr", package="causalBatch")

Now to run the test and evaluate with \(\alpha = 0.05\):

result <- cb.detect.caus_cdcorr(sim.simpl$Ys, sim.simpl$Ts, sim.simpl$Xs, R=100)
print(sprintf("p-value: %.4f", result$Test$p.value))
#> [1] "p-value: 0.0099"

Since the \(p\)-value is \(< \alpha\), this indicates that the data provide evidence to reject the null hypothesis in favor of the alternative (there is a causal group/batch effect). Next, we can correct for this effect using causal conditional ComBat. Note in particular that the covariates should be passed as a named data frame:

cor.sim.simpl <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, 
                                         data.frame(Covar=sim.simpl$Xs),
                                         match.form="Covar")
#> Adjusting for1covariate(s) or covariate level(s)

We can plot the augmented data after matching conditional ComBat has been applied, like so:

plot.sim(cor.sim.simpl$Ys.corrected, cor.sim.simpl$Ts, cor.sim.simpl$Xs$Covar,
         title="Sigmoidal Simulation (matching cComBat corrected)")

We can see that the augmented data nearly perfectly overlaps across the two groups/batches, which can be confirmed using the causal conditional distance correlation:

result.cor <- cb.detect.caus_cdcorr(cor.sim.simpl$Ys.corrected, cor.sim.simpl$Ts,
                                    cor.sim.simpl$Xs$Covar, R=100)
print(sprintf("p-value: %.4f", result.cor$Test$p.value))
#> [1] "p-value: 0.9109"

Since the \(p\)-value is \(> \alpha\), the data provide no evidence to reject the null hypothesis (that there is a causal group/batch effect) in favor of the alternative. Stated another way, the group/batch effect that was present before matching conditional ComBat correction is no longer detectable.

If we have multiple covariate values, you can specify custom formulas for the matching process using the standard R formula notation. For instance, if we had a second covariate value:

Xs.2covar <- data.frame(Covar1=sim.simpl$Xs, Covar2=runif(n))

You could specify a formula instead like this:

cor.sim <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, Xs.2covar, 
                                   match.form="Covar1 + Covar2")
#> Adjusting for2covariate(s) or covariate level(s)

You can also use the standard MatchIt::matchit() arguments to leverage custom matching algorithms. These are passed into the function using the match.args argument. For instance, consider if we have a third categorical covariate:

Xs.3covar <- cbind(data.frame(Cat.Covar=factor(rbinom(n, size=1, 0.5))), 
                   Xs.2covar)

If we wanted to leverage exact matching for the categorical covariate, we could specify to perform exact matching for Cat.Covar, and leave nearest-neighbor matching (without replacement) for Covar1 and Covar2, exactly as you would for the MatchIt::matchit() package:

match.args <- list(method="nearest", exact="Cat.Covar", replace=FALSE, 
                   caliper=0.1)
cor.sim <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, Xs.3covar, 
                                   match.form="Covar1 + Covar2 + Cat.Covar",
                                   match.args=match.args)
#> Warning: Fewer control units than treated units in some `exact` strata; not all
#> treated units will get a match.
#> Adjusting for3covariate(s) or covariate level(s)

Applying Matching cComBat to out-of-sample data

Intuitively, the causal pre-processing procedures (including propensity trimming and matching) provide a form of bias mitigation from certain covariates, similar to biases due to model misspecifications, when these covariates are asymmetrically distributed across batches. These steps can be visualized by comparing the covariate distributions before and after batch effect correction; e.g., via a histogram:

# a function for plotting a histogram plot of the covariates
plot.covars <- function(Ts, Xs, title="", 
                     xlabel="Covariate",
                     ylabel="Number of Samples") {
  data = data.frame(Covariates=as.numeric(Xs),
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE))
  
  data %>%
    ggplot(aes(x=Covariates, color=Group, fill=Group)) +
    geom_histogram(position="identity", alpha=0.5) +
    labs(title=title, x=xlabel, y=ylabel) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_y_continuous(limits=c(0, 12)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    scale_fill_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}

ggarrange(plot.covars(sim.simpl$Ts, sim.simpl$Xs, title="(A) Unfiltered samples"),
          plot.covars(cor.sim.simpl$Ts, cor.sim.simpl$Xs$Covar,
                      title="(B) Matched + Trimmed samples"),
          nrow=2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).

Note that the number of samples retained (the sum of the counts) is much lower after matching and trimming.

Batch effect corrections learned via cb.correct.matching_cComBat() are reasonably interpretable for samples within a range of covariate overlap across the batches included in the correction, which is more general than a fully matching subset (used for the estimation procedure). This means that it may be reasonable to apply learned corrections to certain out-of-sample data which have similar covariates to the included samples, even if these samples are not part of the fully matched subset used directly for estimation of the batch effect correction model. To do so, we can use the apply.oos argument, which has the effect of learning batch effects from a fully matched subset of points, and then applying the learned batch effects to a less stringent covariate-overlapping subset of samples. This therefore has the effect of learning the batch effect correction from a subset of samples chosen to reduce bias, and then applies it to a wider subset of samples which are similar to the samples used for bias reduction.

cor.sim.oos <- cb.correct.matching_cComBat(sim.simpl$Ys, sim.simpl$Ts, data.frame(Covar=sim.simpl$Xs), 
                                       match.form="Covar", apply.oos=TRUE)
#> Adjusting for1covariate(s) or covariate level(s)

ggarrange(plot.covars(cor.sim.oos$Ts, cor.sim.oos$Xs$Covar, 
                      title="(A) In- and out-of-sample data"),
          plot.sim(cor.sim.oos$Ys.corrected, cor.sim.oos$Ts, cor.sim.oos$Xs$Covar,
                   title="(B) matching cComBat on in- and out-of-sample data"),
          nrow=2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).

This applies matching cComBat to the in-sample points (the fully matching subset across all batches, all of whom together are chosen such that the batches are covariate balanced, or the covariate distributions are rendered approximately equal, across batches) as well as out-of-sample points (data from individual batches who fall within a range of covariate overlap across all batches).

This behavior can be tuned directly via the cb.correct.apply_cComBat() function. We caution individuals from using this function directly without performing steps to ensure that the subset of samples the batch effect correction is learned from and the subset of samples to whom the batch effect correction are applied have similar covariate ranges. The code below computes the points within a range of covariate overlap across all batches using vertex matching.

oos.ids <- cb.align.vm_trim(sim.simpl$Ts, sim.simpl$Xs)
Ys.oos <- sim.simpl$Ys[oos.ids,,drop=FALSE]; Ts.oos <- sim.simpl$Ts[oos.ids]
Xs.oos <- sim.simpl$Xs[oos.ids,,drop=FALSE]

Ys.oos.cor <- cb.correct.apply_cComBat(Ys.oos, Ts.oos, data.frame(Covar=Xs.oos),
                                       cor.sim.oos$Model)

plot.sim(Ys.oos.cor, Ts.oos, Xs.oos, title=" matching cComBat applied to OOS data")