Where

The where package has one main function run() that provides a clean syntax for vectorising the use of NSE (non-standard evaluation), for example in ggplot2, dplyr, or data.table. There are also two (infix) wrappers %where% and %for% that provide arguably cleaner syntax. A typical example might look like

subgroups <- .(all        = TRUE,
               long_sepal = Sepal.Length > 6,
               long_petal = Petal.Length > 5.5)

(iris %>%
  filter(x) %>%
  summarise(across(Sepal.Length:Petal.Width,
                   mean),
            .by = Species)) %for% subgroups

Here we have a population dataset and various subpopulations of interest and we want to apply the same code over all subpopulations. If the subpopulations were a partition of the data (for example, a census population could be divided into 5 year age bands), then we can use group_by() in dplyr or faceting in ggplot, for example, to apply the same code over all subpopulations. In general, however, the populations will not be so easy to apply over, for example if we have some defined by age, others by gender, and then others as a combination of the two. A variable that allows multiple options to be selected (for example, ethnicity in the New Zealand Census), can alone define subpopulations (in this case ethnic groups) that cannot be vectorised over with the partitioning functionality (like group by and faceting) in standard packages. The where package makes these examples straightforward.

Simple example

As a running example we will use the iris dataset and the following (largely unnatural) sub-populations of irises:

These subgroups can be captured with the .() function to capture the filter conditions used to define these populations:

subgroups <- .(all        = TRUE,
               long_sepal = Sepal.Length > 6,
               long_petal = Petal.Length > 5.5)

To utilise these subgroups directly with standard R is tricky. For example we could form the separate populations with repeated code.

# With base R
iris
iris[iris[["Sepal.Length"]] > 6, ] # or with(iris, iris[Sepal.Length > 6])
iris[iris[["Petal.Length"]] > 5.5, ] # or with(iris, iris[Petal.Length > 5.5])

# With dplyr
iris
filter(iris, Sepal.Length > 6)
filter(iris, Petal.Length > 5.5)

# With data.table
iris
as.data.table(iris)[Sepal.Length > 6]
as.data.table(iris)[Petal.Length > 5.5]

or this could be done by first explicitly capturing expressions (as done above with .) and then evaluating them:

lapply(subgroups, function(group) with(iris, iris[eval(group), ]))

This requires some comfort with managing expressions in R and can quickly get messy with more complex queries, particularly if we want to apply across more than one set of expressions. The run() function hides these manipulations:

run(with(iris, iris[subgroup, ]),
       subgroup = subgroups)

# or
with(iris, iris[x, ]) %for% subgroups

More interesting examples

A standard group by and summarise operation:

library(dplyr)

subgroups = .(all        = TRUE,
              long_sepal = Sepal.Length > 6,
              long_petal = Petal.Length > 5.5)
functions = .(mean, sum, prod)

run(
  iris %>%
    filter(subgroup) %>%
    summarise(across(Sepal.Length:Petal.Width,
                     summary),
              .by = Species),
  subgroup = subgroups,
  summary  = functions
)

The same using data.table:

library(data.table)
df <- as.data.table(iris)

run(df[subgroup, lapply(.SD, functions), keyby = "Species",
      .SDcols = Sepal.Length:Petal.Width],
   subgroup  = subgroups,
   functions = functions)

Producing the same ggplot over the different populations:

library(ggplot2)

plots <- run(
  ggplot(filter(iris, subgroup),
         aes(Sepal.Length, Sepal.Width)) +
    geom_point() +
    theme_minimal(),
  subgroup = subgroups
)

Map(function(plot, name) plot + ggtitle(name), plots, names(plots))

Or different plots for the full population:

run(
  ggplot(iris,
         aes(Sepal.Length, Sepal.Width)) +
    plot +
    theme_minimal(),
  plot = .(geom_point(), 
           geom_smooth())
)

A limitation

A natrual extension of the previous example can fail is a non-obvious way, due to expressions being executed differently than might be intended. For example the following does not work

# Fails
run(
  ggplot(iris,
         aes(Sepal.Length, Sepal.Width)) +
    plot +
    theme_minimal(),
  plot = .(geom_point(), 
           geom_smooth(), 
           geom_quantile() + geom_rug())
)

since, for the third plot, it tries to evaluate

# Fails
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
    (geom_quantile() + geom_rug()) + 
    theme_minimal()

and geom_quantile() + geom_rug() throws an error. This particular use case can be accomplished by putting the separate geoms in a list

run(
  ggplot(iris,
         aes(Sepal.Length, Sepal.Width)) +
    plot +
    theme_minimal(),
  plot = .(point  = geom_point(), 
           smooth = geom_smooth(), 
           quantilerug = list(geom_quantile(), 
                              geom_rug()))
)

# or by separating out the combined geoms as a function (also using a list)
geom_quantilerug <- function() list(geom_quantile(), 
                                    geom_rug())

run(
  ggplot(iris,
         aes(Sepal.Length, Sepal.Width)) +
    plot +
    theme_minimal(),
  plot = .(point  = geom_point(), 
           smooth = geom_smooth(), 
           quantilerug = geom_quantilerug())
)

run in a function

We can call run() from within a function to further hide details. For example, we could produce subpopulation summaries for the different species of iris:

population_summaries <- function(df) run(with(df, df[subgroup, ]),
                                            subgroup = subgroups)

as.data.table(iris)[, .(population_summaries(.SD)), keyby = "Species"]

As a more general example, if we are undertaking an analysis of different subpopulations, then we could fix the populations in a function and apply code immediately over all groups.

on_subpopulations <- function(expr,
                              populations = subgroups)
  eval(substitute(run(expr, subgroup = populations),
                  list(expr = substitute(expr))))

on_subpopulations(as.data.table(iris)[subgroup])

on_subpopulations(
  iris %>%
    filter(subgroup) %>%
    summarise(across(Sepal.Length:Petal.Width,
                     mean),
              .by = Species)
)

on_subpopulations(
  ggplot(filter(iris, subgroup),
         aes(Sepal.Length, Sepal.Width)) +
    geom_point() +
    theme_minimal()
)

As when following the DRY (Don’t Repeat Yourself) principle in general, this isolation makes it straightforward to add a new subpopulation, here by editing the subgroups:

subgroups = .(all        = TRUE,
              long_sepal = Sepal.Length > 6,
              long_petal = Petal.Length > 5.5,
              veriscolor = Species == "versicolor")

Taking things to the absurd, we can also isolate out the analysis code:

analyses <- .(subset    = as.data.table(iris)[subgroup],
              summarise = iris %>%
                filter(subgroup) %>%
                summarise(across(Sepal.Length:Petal.Width,
                                 mean),
                          .by = Species),
              plot      = ggplot(filter(iris, subgroup),
                                 aes(Sepal.Length, Sepal.Width)) +
                geom_point() +
                theme_minimal())

lapply(analyses,
       function(expr) do.call("on_subpopulations", list(expr)))

A small warning

The ggplot example

on_subpopulations(
  ggplot(filter(iris, subgroup),
         aes(Sepal.Length, Sepal.Width)) +
    geom_point() +
    theme_minimal()
)

does not give identical results to executing the ggplot code with the given subgroups, since the ggplot object stores the execution environment, which will be different.

If important, this can be remedied by capturing and passing the calling environment in the on_subpopulations() function:

on_subpopulations <- function(expr,
                              populations = subgroups) {
  e <- parent.frame()
  eval(substitute(run(expr, subgroup = populations, e = e),
                  list(expr = substitute(expr))))
}

Infix notation

As some syntactic sugar, there are also two infix versions of run:

as.data.table(iris)[subgroup, lapply(.SD, summary), keyby = "Species",
                    .SDcols = Sepal.Length:Petal.Width] %where% 
  list(subgroup = subgroups[1:3],
       summary  = functions)

# note `subgroup` replaced with 'x'
as.data.table(iris)[x, lapply(.SD, mean), keyby = "Species",
                    .SDcols = Sepal.Length:Petal.Width] %for% 
  subgroups

Complex expressions (for example, with pipes or +) need to be wrapped with “()” or “{}”. For example

(iris %>%
    filter(x) %>%
    summarise(across(Sepal.Length:Petal.Width,
                     mean),
              .by = Species)) %for% subgroups

An additional %with% function provides a similar syntax to %where% for standard evaluation:

(a + b) %with% {
  a = 1
  b = 2
}