The where
package has one main function
run()
that provides a clean syntax for vectorising the use
of NSE (non-standard evaluation), for example in ggplot2
,
dplyr
, or data.table
. There are also two
(infix) wrappers %where%
and %for%
that
provide arguably cleaner syntax. A typical example might look like
<- .(all = TRUE,
subgroups long_sepal = Sepal.Length > 6,
long_petal = Petal.Length > 5.5)
%>%
(iris filter(x) %>%
summarise(across(Sepal.Length:Petal.Width,
mean),.by = Species)) %for% subgroups
Here we have a population dataset and various subpopulations of
interest and we want to apply the same code over all subpopulations. If
the subpopulations were a partition of the data (for example, a census
population could be divided into 5 year age bands), then we can use
group_by()
in dplyr
or faceting in
ggplot
, for example, to apply the same code over all
subpopulations. In general, however, the populations will not be so easy
to apply over, for example if we have some defined by age, others by
gender, and then others as a combination of the two. A variable that
allows multiple options to be selected (for example, ethnicity in the
New Zealand Census), can alone define subpopulations (in this case
ethnic groups) that cannot be vectorised over with the partitioning
functionality (like group by and faceting) in standard packages. The
where
package makes these examples straightforward.
As a running example we will use the iris
dataset and
the following (largely unnatural) sub-populations of irises:
These subgroups can be captured with the .()
function to
capture the filter conditions used to define these populations:
<- .(all = TRUE,
subgroups long_sepal = Sepal.Length > 6,
long_petal = Petal.Length > 5.5)
To utilise these subgroups directly with standard R is tricky. For example we could form the separate populations with repeated code.
# With base R
iris"Sepal.Length"]] > 6, ] # or with(iris, iris[Sepal.Length > 6])
iris[iris[["Petal.Length"]] > 5.5, ] # or with(iris, iris[Petal.Length > 5.5])
iris[iris[[
# With dplyr
irisfilter(iris, Sepal.Length > 6)
filter(iris, Petal.Length > 5.5)
# With data.table
irisas.data.table(iris)[Sepal.Length > 6]
as.data.table(iris)[Petal.Length > 5.5]
or this could be done by first explicitly capturing expressions (as
done above with .
) and then evaluating them:
lapply(subgroups, function(group) with(iris, iris[eval(group), ]))
This requires some comfort with managing expressions in R and can
quickly get messy with more complex queries, particularly if we want to
apply across more than one set of expressions. The run()
function hides these manipulations:
run(with(iris, iris[subgroup, ]),
subgroup = subgroups)
# or
with(iris, iris[x, ]) %for% subgroups
A standard group by and summarise operation:
library(dplyr)
= .(all = TRUE,
subgroups long_sepal = Sepal.Length > 6,
long_petal = Petal.Length > 5.5)
= .(mean, sum, prod)
functions
run(
%>%
iris filter(subgroup) %>%
summarise(across(Sepal.Length:Petal.Width,
summary),.by = Species),
subgroup = subgroups,
summary = functions
)
The same using data.table
:
library(data.table)
<- as.data.table(iris)
df
run(df[subgroup, lapply(.SD, functions), keyby = "Species",
.SDcols = Sepal.Length:Petal.Width],
subgroup = subgroups,
functions = functions)
Producing the same ggplot
over the different
populations:
library(ggplot2)
<- run(
plots ggplot(filter(iris, subgroup),
aes(Sepal.Length, Sepal.Width)) +
geom_point() +
theme_minimal(),
subgroup = subgroups
)
Map(function(plot, name) plot + ggtitle(name), plots, names(plots))
Or different plots for the full population:
run(
ggplot(iris,
aes(Sepal.Length, Sepal.Width)) +
+
plot theme_minimal(),
plot = .(geom_point(),
geom_smooth())
)
A natrual extension of the previous example can fail is a non-obvious way, due to expressions being executed differently than might be intended. For example the following does not work
# Fails
run(
ggplot(iris,
aes(Sepal.Length, Sepal.Width)) +
+
plot theme_minimal(),
plot = .(geom_point(),
geom_smooth(),
geom_quantile() + geom_rug())
)
since, for the third plot, it tries to evaluate
# Fails
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_quantile() + geom_rug()) +
(theme_minimal()
and geom_quantile() + geom_rug()
throws an error. This
particular use case can be accomplished by putting the separate
geom
s in a list
run(
ggplot(iris,
aes(Sepal.Length, Sepal.Width)) +
+
plot theme_minimal(),
plot = .(point = geom_point(),
smooth = geom_smooth(),
quantilerug = list(geom_quantile(),
geom_rug()))
)
# or by separating out the combined geoms as a function (also using a list)
<- function() list(geom_quantile(),
geom_quantilerug geom_rug())
run(
ggplot(iris,
aes(Sepal.Length, Sepal.Width)) +
+
plot theme_minimal(),
plot = .(point = geom_point(),
smooth = geom_smooth(),
quantilerug = geom_quantilerug())
)
We can call run()
from within a function to further hide
details. For example, we could produce subpopulation summaries for the
different species of iris:
<- function(df) run(with(df, df[subgroup, ]),
population_summaries subgroup = subgroups)
as.data.table(iris)[, .(population_summaries(.SD)), keyby = "Species"]
As a more general example, if we are undertaking an analysis of different subpopulations, then we could fix the populations in a function and apply code immediately over all groups.
<- function(expr,
on_subpopulations populations = subgroups)
eval(substitute(run(expr, subgroup = populations),
list(expr = substitute(expr))))
on_subpopulations(as.data.table(iris)[subgroup])
on_subpopulations(
%>%
iris filter(subgroup) %>%
summarise(across(Sepal.Length:Petal.Width,
mean),.by = Species)
)
on_subpopulations(
ggplot(filter(iris, subgroup),
aes(Sepal.Length, Sepal.Width)) +
geom_point() +
theme_minimal()
)
As when following the DRY (Don’t Repeat Yourself) principle in general, this isolation makes it straightforward to add a new subpopulation, here by editing the subgroups:
= .(all = TRUE,
subgroups long_sepal = Sepal.Length > 6,
long_petal = Petal.Length > 5.5,
veriscolor = Species == "versicolor")
Taking things to the absurd, we can also isolate out the analysis code:
<- .(subset = as.data.table(iris)[subgroup],
analyses summarise = iris %>%
filter(subgroup) %>%
summarise(across(Sepal.Length:Petal.Width,
mean),.by = Species),
plot = ggplot(filter(iris, subgroup),
aes(Sepal.Length, Sepal.Width)) +
geom_point() +
theme_minimal())
lapply(analyses,
function(expr) do.call("on_subpopulations", list(expr)))
The ggplot
example
on_subpopulations(
ggplot(filter(iris, subgroup),
aes(Sepal.Length, Sepal.Width)) +
geom_point() +
theme_minimal()
)
does not give identical results to executing the ggplot
code with the given subgroups, since the ggplot object stores the
execution environment, which will be different.
If important, this can be remedied by capturing and passing the
calling environment in the on_subpopulations()
function:
<- function(expr,
on_subpopulations populations = subgroups) {
<- parent.frame()
e eval(substitute(run(expr, subgroup = populations, e = e),
list(expr = substitute(expr))))
}
As some syntactic sugar, there are also two infix versions of
run
:
%where%
is a full infix version of run
taking the expression as the left argument and a named list of values to
be substituted as the right argument.%for%
has slightly simplified syntax but only allows
one substitution, for the symbol x
.as.data.table(iris)[subgroup, lapply(.SD, summary), keyby = "Species",
= Sepal.Length:Petal.Width] %where%
.SDcols list(subgroup = subgroups[1:3],
summary = functions)
# note `subgroup` replaced with 'x'
as.data.table(iris)[x, lapply(.SD, mean), keyby = "Species",
= Sepal.Length:Petal.Width] %for%
.SDcols subgroups
Complex expressions (for example, with pipes or +
) need
to be wrapped with “()” or “{}”. For example
%>%
(iris filter(x) %>%
summarise(across(Sepal.Length:Petal.Width,
mean),.by = Species)) %for% subgroups
An additional %with%
function provides a similar syntax
to %where%
for standard evaluation:
+ b) %with% {
(a = 1
a = 2
b }