Simulations for Causal Batch/Group Effects

Eric W. Bridgeford

2025-01-07

require(causalBatch)
require(ggplot2)
require(tidyr)
n = 200

In this vignette, we will generate and plot simulations that will be used throughout this package and in the corresponding manuscript(s) for detecting and correction for group-level/batch effects in multi-dimensional scientific data. These simulations will focus on examples with \(n=200\) samples in \(d=2\) dimensions.

We will start by generating some code which will produce exploratory plots for us:

# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="", 
                     xlabel="Covariate",
                     ylabel="Outcome (1st dimension)") {
  data = data.frame(Y1=Ys[,1], Y2=Ys[,2], 
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE), 
                    Covariates=Xs)
  
  data %>%
    ggplot(aes(x=Covariates, y=Y1, color=Group)) +
    geom_point() +
    labs(title=title, x=xlabel, y=ylabel) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}

# a function for plotting a scatter plot of the data, along with
# lines denoting the expected outcome (per covariate level)
plot.sim.expected <- function(Ys, Ts, Xs, Ytrue, Ttrue, Xtrue, title="",
                           xlabel="Covariate",
                           ylabel="Outcome (1st dimension)") {
  data = data.frame(Y1=Ys[,1], Y2=Ys[,2], 
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE), 
                    Covariates=Xs)
  data.true = data.frame(Y1=Ytrue[,1], Y2=Ytrue[,2], 
                         Group=factor(Ttrue, levels=c(0, 1), ordered=TRUE), 
                         Covariates=Xtrue)
  
  data %>%
    ggplot(aes(x=Covariates, y=Y1, color=Group)) +
    geom_point(alpha=0.5) +
    labs(title=title, x=xlabel, y=ylabel) +
    geom_line(data=data.true, aes(group=Group), linewidth=1.2) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}

Sigmoidal Simulation

In this simulation, we will generate a sigmoidal simulation with \(n=200\) samples, where the samples have an equal probability of being from each group/batch. The n argument controls the number of samples.

sim = cb.sims.sim_sigmoid(n=n)

plot.sim(sim$Ys, sim$Ts, sim$Xs, title="Sigmoidal Simulation")

The default is for the level of covariate overlap, the area under the curve upper-bounded by the minimum density across both groups/batches, is \(1\):

print(sprintf("Covariate Overlap: %.3f", sim$Overlap))
#> [1] "Covariate Overlap: 1.000"

Notice that the outcomes for group/batch \(0\) appear to generally be greater than the outcomes for group/batch \(1\). This is confirmed by looking at the expected signal for each dimension, shown below as a line:

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Sigmoidal Simulation")

When the covariates are balanced, the relationship is rather obvious and easy to differentiate without needing to look at the expected outcomes at all. However, when the covariates become imbalanced, it becomes less obvious. This is controlled by the unbalancedness parameter, and the covariate distributions are the same across groups/batches only when unbalancedness is \(1\). For instance, when we set it to \(3\):

sim = cb.sims.sim_sigmoid(n=n, unbalancedness = 3)

plot.sim(sim$Ys, sim$Ts, sim$Xs, 
         title="Sigmoidal Simulation (Imbalanced Covariates)")

In this case, using only the observed samples, it is much harder to tell that group/batch \(0\) tends to have higher outcomes than group/batch \(1\), despite the fact that the expected signals (for a given covariate level) are the same as the preceding simulation:

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Sigmoidal Simulation")

Note also that the level of covariate overlap has decreased:

print(sprintf("Covariate Overlap: %.3f", sim$Overlap))
#> [1] "Covariate Overlap: 0.125"

If we shift unbalancedness below \(1\), this pattern is reversed: group/batch \(0\) will tend to be to the right of the origin, and group/batch \(1\) to the left of the origin:

sim = cb.sims.sim_sigmoid(n=n, unbalancedness = 1/3)

plot.sim(sim$Ys, sim$Ts, sim$Xs, 
         title="Sigmoidal Simulation (Imbalanced Covariates)")

Linear Simulation

In this simulation, we will generate a linear simulation with \(n=200\) samples, where the samples have an equal probability of being from each group/batch.

sim = cb.sims.sim_linear(n=n)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Linear Simulation")

We can manipulate the ratio of samples in each group, via the \(\pi\) argument, where the probability a sample is from group/batch \(1\) is \(\pi\), and the probability a sample is from group/batch \(0\) is \(1 - \pi\). Notice that decreasing \(\pi\) to \(0.25\) yields more samples in batch \(0\) than batch \(1\):

sim = cb.sims.sim_linear(n=n, pi = 0.25)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Linear Simulation (Imbalanced Groups/Batches)")

and there are relatively few numbers of blue points in group/batch \(1\).

Impulse Simulation

In this simulation, we will generate an impulse simulation with \(n=200\) samples, where the samples have an equal probability of being from each group/batch.

sim = cb.sims.sim_impulse(n=n)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Impulse Simulation")

We can use the nbreaks argument to get finer (nbreaks is larger) or more coarse (nbreaks is smaller) granularity for the expected signal at a given covariate level. For instance, decreasing nbreaks to \(5\) breakpoints gives:

sim = cb.sims.sim_impulse(n=n, nbreaks=5)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Impulse Simulation")

Impulse Simulation (Asymmetric Covariates)

In this simulation, we will generate an impulse simulation with \(n=200\) samples. In the preceding examples, notice that as the level of covariate overlap (controlled by unbalancedness) decreases, the covariate distributions shift (symmetrically) to opposite sides of the origin. For this simulation, that is not the case; for gorup \(0\), the covariate distribution is symmetric about the origin, but for group \(1\), the covariate distribution is asymmetric about the origin (and controlled by unbalancedness). Let’s see first when unbalancedness is \(1\), and the covariate distributions are the same:

sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 1)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Impulse Simulation")

When we shift the unbalancedness higher, notice that the covariate distribution for group/batch \(1\) shifts to the right, but for group/batch \(0\) remains in the same (symmetric about the origin) place:

sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 4)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Impulse Simulation (Asymmetric Covariates)")

This pattern/observation is again reversed when unbalancedness is below 1:

sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 1/5)

plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
                  sim$Ytrue, sim$Ttrue, sim$Xtrue,
                  title="Impulse Simulation (Asymmetric Covariates)")

To fully appreciate the breadth/options for the simulations, be sure to check their individual documentation pages using the help() menu.