This vignette provides examples of setting up non-spatial and spatial finite mixture models (FMMs) using clustTMB.
The meuse data set from the sp library (Pebesma and Bivand 2005) consists of 155 observations, 4 response variables, location data, and 8 covariates. The response involves four heavy metals measured to monitor pollution levels in the top soil of the Meuse river floodplain, Netherlands (Fig. 1.1, (Rikken and Rijn 1993)). The heavy metal data exhibit normality on the logscale (Fig. 1.2), allowing for a comparison between the Gaussian and lognormal distributions in addition to an FMM with and without spatial structure influencing the cluster probability.
# load data
library(sp)
data("meuse")
data("meuse.riv")
data("meuse.grid")
The simple Gaussian FMM can be fit using the following code:
library(clustTMB)
<- clustTMB(response = meuse[, 3:6], G = 3, covariance.structure = "VVV") mod.gauss
Specifying a lognormal distribution is implemented using the family and link specification:
<- clustTMB(
mod.ln response = meuse[, 3:6],
family = lognormal(link = "identity"),
G = 3, covariance.structure = "VVV"
)
The inclusion of spatial random effects in the expectation of the cluster probability in clustTMB depends on the SPDE-FEM approximation to a spatial Gaussian Markov Random Field (GMRF) introduced by the package, R-INLA. See Spatial GMRF and Gating Network for details on these clustTMB formulations. The fmesher R package (github source code) is used to run functions needed to implement this SPDE-FEM method.
The spatial model in clustTMB first requires the definition of the spatial mesh. This mesh defines the discretization of the continuous space using a constrained Delaunay triangulation. For details on building the spatial mesh, see Krainski et al., 2021, Sec 2.6.
As an example, the spatial mesh for the meuse data is built using the fmesher functions, fm_nonconvex_hull() to set up the boundary and fm_mesh_2d() to generate the spatial mesh:
library(fmesher)
<- meuse[, 1:2]
loc <- fmesher::fm_nonconvex_hull(as.matrix(loc), convex = 200)
Bnd <- fmesher::fm_mesh_2d(as.matrix(loc),
meuse.mesh max.edge = c(300, 1000),
boundary = Bnd
)
The inlabru R package can be used to visualize the mesh:
ggplot() +
::gg(meuse.mesh) +
inlabrugeom_point(mapping = aes(x = loc[, 1], y = loc[, 2], size = 0.5), size = 0.5) +
theme_classic()
Coordinates are converted to a spatial point data frame and read into the clustTMB model, along with the mesh, using the spatial.list argument. Spatial projections can be generated by defining a spatial data frame of prediction coordinates. These can be passed into the model via the projection.dat argument of clustTMB. The gating formula is specified using the gmrf() command:
# convert coordinates to a spatial point data frame
<- sf::st_as_sf(loc, coords = c("x", "y"))
Loc
# define spatial prediction coordinates
data("meuse.grid")
<- sf::st_as_sf(meuse.grid, coords = c("x", "y"))
Meuse.Grid <- clustTMB(
mod.ln.sp response = meuse[, 3:6],
family = lognormal(link = "identity"),
gatingformula = ~ gmrf(0 + 1 | loc),
G = 4, covariance.structure = "VVV",
spatial.list = list(loc = Loc, mesh = meuse.mesh),
projection.dat = Meuse.Grid
)
## intercept removed from gatingformula
## when random effects specified
Results can be viewed via model output:
# Estimated fixed parameters
summary(mod.ln.sp$sdr, "fixed")
## Estimate Std. Error
## betag 0.1810561 0.51706191
## betag 0.5594793 0.51158197
## betag 0.1898442 0.50645553
## betad 2.0157745 0.09100872
## betad 4.3160880 0.03898693
## betad 5.4259819 0.08790516
## betad 6.7095831 0.08920048
## betad 1.0164082 0.03989472
## betad 3.6119034 0.03016282
## betad 5.2215817 0.04708007
## betad 6.2274867 0.04395365
## betad 0.1353846 0.06134522
## betad 3.1482198 0.02002521
## betad 4.2137115 0.04068754
## betad 5.2614025 0.03329950
## betad -1.4361504 0.05510456
## betad 3.1132966 0.03560916
## betad 4.2118588 0.04706044
## betad 5.1996568 0.04976672
## theta -1.2100810 0.16013190
## theta -2.9055386 0.19071087
## theta -1.2794746 0.15756893
## theta -1.2502187 0.13950082
## theta -2.5718215 0.17343496
## theta -3.1310896 0.14034774
## theta -2.2406099 0.13544250
## theta -2.3780380 0.13367632
## theta -1.8212760 0.17220439
## theta -4.0603269 0.19161568
## theta -2.6424666 0.14302620
## theta -3.0432260 0.15385305
## theta -2.4648411 0.25661400
## theta -3.3381004 0.23942344
## theta -2.7804404 0.20016915
## theta -2.6686130 0.19753614
## ln_kappag -5.9346215 0.28269502
# Minimum negative log likelihood
$opt$objective mod.ln.sp
## [1] 2318.922
A cluster analysis was run on the meuse dataset using the Gaussian and lognormal family with and without spatial random effects in the gating network for clusters ranging from 2 to 10. BIC scores favored the lognormal spatial model with 4 clusters (Table 2.1). This model resulted in fewer clusters compared to the spatial Gaussian model.
family | space | clusters | BIC |
---|---|---|---|
lognormal | 1 | 4 | 4805 |
lognormal | 0 | 4 | 4861 |
Gaussian | 1 | 6 | 4861 |
Gaussian | 0 | 3 | 4910 |
Results from the optimal model suggested a spatial pattern where the highest ppm observations for all four metals were clustered together (Cluster 0) in a narrow strip along the bank of the Meuse River within the northwest corner of the floodplain (Fig. 2.1, Fig. 2.2). A separate cluster (Cluster 3) was characterized by low ppm values for all metals and was spatially distributed in the central region of the floodplain. The other two clusters were characterized by moderately low (Cluster 2) and moderately high (Cluster 1) ppm values for all metals. Spatial predictions of clustered heavy metals in the Meuse River floodplain (Fig. 2.2) can aid in risk assessment and environmental mitigation measures after flood events.
clustTMB fits spatial random effects using a Gaussian Markov Random Field (GMRF). The precision matrix, \(Q\), of the GMRF is the inverse of a Matern covariance function and takes two parameters: 1) \(\kappa\), which is the spatial decay parameter and a scaled function of the spatial range, \(\phi = \sqrt{8}/\kappa\), the distance at which two locations are considered independent; and 2) \(\tau\), which is a function of \(\kappa\) and the marginal spatial variance \(\sigma^{2}\):
\[\tau = \frac{1}{2\sqrt{\pi}\kappa\sigma}.\] The precision matrix is approximated following the SPDE-FEM approach, where a constrained Delaunay triangulation network is used to discretize the spatial extent in order to determine a GMRF for a set of irregularly spaced locations, \(i\). For details on the SPDE-FEM approach, see Krainski et al., 2021, Sec. 2.2
\[\omega_{i} \sim GMRF(Q[\kappa, \tau])\]
When random effects, \(\mathbb{u}\), are specified in the gating network, the probability of cluster membership \(\pi_{i,g}\) for observation \(i\) is fit using multinomial regression:
\[ \begin{align} \mathbb{\eta}_{,g} &= X\mathbb{\beta}_{,g} + \mathbb{u}_{,g} \\ \mathbb{\pi}_{,g} &= \frac{ exp(\mathbb{\eta}_{,g})}{\sum^{G}_{g=1}exp(\mathbb{\eta}_{,g})} \end{align} \]