One of the main utilities of the epigraph and the hypograph index is to be able to measure ``extremality’’ of a curve with respect to a bunch of curves and to provide and ordination of the curves from top to bottom or vice-versa.
EHyClus is a methodology for clustering functional data which is based in four steps:
In ehymet
, the function that allows us to perform this
process is EHyClus
.
EHyClus function is designed for clustering functional datasets. The input data can be a one-dimensional dataset, \(n \times p\) matrix, or a multidimensional one, \(n \times p \times q\) array. The function transforms the initial dataset by first smoothing the data and then applying different indices analyzed on the first vignette such as the epigraph, hypograph, and their modified versions, and then applies various clustering algorithms to this dataset. Indices for one or multiple dimension are applied depending on the size of the input data.
It supports multiple clustering methods: hierarchical
clustering (“hierarch”), k-means (“kmeans”),
kernel k-means (“kkmeans”), and spectral
clustering (“spc”). Also, it allows customization of
hierarchical clustering methods, distance metrics, and kernels using
parameters like l_dist_hierarch
.
For smoothing, it uses B-splines with a specified or automatically selected number of basis functions.
To check the quality of the results, if true labels are provided, the function can validate the clustering results and compute performance metrics such as purity, F-measure, and Rand Index (RI). Also, it records the time taken for each clustering method, in case you want to measure it in case of wanting to know the trade-off between results and execution time
All the parameters and its functionality can be found on the function documentation, but arguably the most important ones are:
In this subsection, we are going to display how to obtain the results
of the EHyClus
function and validate them (if
true_labels
are present). First, we are going to generate
multidimensional data using sim_model_ex2
:
And, as we are generating \(n = 50\) curves per group and there are two groups, the true labels are:
We can run the algorithm with or without the true labels. For the combinations of variables, we will use “d2dtaMEI” with “d2dtaMHI”.
res <- EHyClus(curves, vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")))
res_with_labels <- EHyClus(curves,
vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")),
true_labels = true_labels
)
For the result without labels, it can be seen that we only have the generated clusters:
Whereas the one we generated with labels has both
cluster
and metrics
:
Let’s explore both components of the latter. We’re going to start
with cluster
.
#> levelName
#> 1 cluster
#> 2 ¦--hierarch
#> 3 ¦ ¦--hierarch_single_euclidean_d2dtaMEId2dtaMHI
#> 4 ¦ ¦ ¦--valid
#> 5 ¦ ¦ °--internal_metrics
#> 6 ¦ ¦--hierarch_complete_euclidean_d2dtaMEId2dtaMHI
#> 7 ¦ ¦ ¦--valid
#> 8 ¦ ¦ °--internal_metrics
#> 9 ¦ ¦--hierarch_average_euclidean_d2dtaMEId2dtaMHI
#> 10 ¦ ¦ ¦--valid
#> 11 ¦ ¦ °--internal_metrics
#> 12 ¦ ¦--hierarch_centroid_euclidean_d2dtaMEId2dtaMHI
#> 13 ¦ ¦ ¦--valid
#> 14 ¦ ¦ °--internal_metrics
#> 15 ¦ ¦--hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI
#> 16 ¦ ¦ ¦--valid
#> 17 ¦ ¦ °--internal_metrics
#> 18 ¦ ¦--hierarch_single_manhattan_d2dtaMEId2dtaMHI
#> 19 ¦ ¦ ¦--valid
#> 20 ¦ ¦ °--internal_metrics
#> 21 ¦ ¦--hierarch_complete_manhattan_d2dtaMEId2dtaMHI
#> 22 ¦ ¦ ¦--valid
#> 23 ¦ ¦ °--internal_metrics
#> 24 ¦ ¦--hierarch_average_manhattan_d2dtaMEId2dtaMHI
#> 25 ¦ ¦ ¦--valid
#> 26 ¦ ¦ °--internal_metrics
#> 27 ¦ ¦--hierarch_centroid_manhattan_d2dtaMEId2dtaMHI
#> 28 ¦ ¦ ¦--valid
#> 29 ¦ ¦ °--internal_metrics
#> 30 ¦ °--hierarch_ward.D2_manhattan_d2dtaMEId2dtaMHI
#> 31 ¦ ¦--valid
#> 32 ¦ °--internal_metrics
#> 33 ¦--kmeans
#> 34 ¦ ¦--kmeans_euclidean_d2dtaMEId2dtaMHI
#> 35 ¦ ¦ ¦--valid
#> 36 ¦ ¦ °--internal_metrics
#> 37 ¦ °--kmeans_mahalanobis_d2dtaMEId2dtaMHI
#> 38 ¦ ¦--valid
#> 39 ¦ °--internal_metrics
#> 40 ¦--kkmeans
#> 41 ¦ ¦--kkmeans_rbfdot_d2dtaMEId2dtaMHI
#> 42 ¦ ¦ ¦--valid
#> 43 ¦ ¦ °--internal_metrics
#> 44 ¦ °--kkmeans_polydot_d2dtaMEId2dtaMHI
#> 45 ¦ ¦--valid
#> 46 ¦ °--internal_metrics
#> 47 °--spc
#> 48 ¦--spc_rbfdot_d2dtaMEId2dtaMHI
#> 49 ¦ ¦--valid
#> 50 ¦ °--internal_metrics
#> 51 °--spc_polydot_d2dtaMEId2dtaMHI
#> 52 ¦--valid
#> 53 °--internal_metrics
In the suffix of the names we can see the combination of variables used, that is “d2dtaMEI” with “d2dtaMHI”. We can also see the different parameters used. For example, for the kmeans it is easy to see that it has been performed with both the Euclidean and the Mahalanobis distances.
Looking in particular at some of the elements, we see that it contains the following:
str(res_with_labels$cluster$hierarch$hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI)
#> List of 4
#> $ cluster : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
#> $ valid :List of 4
#> ..$ Purity : num 0.93
#> ..$ Fmeasure: num 0.868
#> ..$ RI : num 0.869
#> ..$ ARI : num 0.737
#> $ internal_metrics:List of 4
#> ..$ davies_bouldin: num 1.01
#> ..$ dunn : num 0.06
#> ..$ silhouette : num 0.409
#> ..$ infomax : num 0.993
#> $ time : num 0.000373
On the one hand we have cluster
which is the vector that
assigns each of the curves to a cluster. Then we have
valid
, which is the validation data. And finally
time
, which is the time that this particular method has
taken to be executed. Let’s take a look at valid
:
head(res_with_labels$cluster$hierarch$hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI$valid)
#> $Purity
#> [1] 0.93
#>
#> $Fmeasure
#> [1] 0.8678
#>
#> $RI
#> [1] 0.8685
#>
#> $ARI
#> [1] 0.737
It gives us 3 metrics: Purity, F-measure and the Rand Index (RI).
However, we can obtain this information in another way. Going back to
the second element of res_with_labels
: metrics
give us a summary of all metrics:
head(res_with_labels$metrics, 3)
#> Purity Fmeasure RI ARI Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI 1.00 1.0000 1.0000 1.000 0.0032920837
#> kmeans_mahalanobis_d2dtaMEId2dtaMHI 1.00 1.0000 1.0000 1.000 0.0038609505
#> hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI 0.93 0.8678 0.8685 0.737 0.0003728867
It gives us the Purity, F-measure, RI and Time for every clustering method with every combination of parameters that has been executed. We can search for “hierarch_single_euclidean_d2dtaMEId2dtaMHI” and see that it yields the same results as seen previously:
res_with_labels$metrics["hierarch_single_euclidean_d2dtaMEId2dtaMHI", ]
#> Purity Fmeasure RI ARI Time
#> hierarch_single_euclidean_d2dtaMEId2dtaMHI 0.52 0.6535 0.4958 8e-04 0.0004339218
But now imagine that we executed EHyClus
without giving
the true_labels
parameter but we want to compute the
metrics. For that purpose, we can use the
clustering_validation
function, using the true labels as
the first parameter and the ones generated by the clustering method as
the second one:
clustering_validation(res$cluster$hierarch$hierarch_single_euclidean_d2dtaMEId2dtaMHI$cluster, true_labels)
#> $Purity
#> [1] 0.52
#>
#> $Fmeasure
#> [1] 0.6535
#>
#> $RI
#> [1] 0.4958
#>
#> $ARI
#> [1] 8e-04
Note that we are only taking a look at some of the result. Let’s see if some of them have given us better metrics. Results are sorted based on the RI:
head(res_with_labels$metrics, 5)
#> Purity Fmeasure RI ARI Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI 1.00 1.0000 1.0000 1.0000 0.0032920837
#> kmeans_mahalanobis_d2dtaMEId2dtaMHI 1.00 1.0000 1.0000 1.0000 0.0038609505
#> hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI 0.93 0.8678 0.8685 0.7370 0.0003728867
#> hierarch_ward.D2_manhattan_d2dtaMEId2dtaMHI 0.93 0.8685 0.8685 0.7370 0.0005681515
#> kkmeans_rbfdot_d2dtaMEId2dtaMHI 0.77 0.6684 0.6422 0.2857 0.0494818687
Indeed, we can see that some methods have given us excellent results, performing an almost perfect clustering.
In addition, if we only want to obtain the results for the best
clustering method, we can use the only_best
parameter. Note
that this parameter only works when true_labels
is
provided.
res_only_best <- EHyClus(curves,
vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")),
true_labels = true_labels,
only_best = TRUE
)
If we inspect the object, we see that in fact it only contains the results of the clustering method that has obtained the best results.
res_only_best$cluster
#> $kmeans_euclidean_d2dtaMEId2dtaMHI
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$cluster
#> [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
#> [55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$Purity
#> [1] 1
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$Fmeasure
#> [1] 1
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$RI
#> [1] 1
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$ARI
#> [1] 1
#>
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$davies_bouldin
#> [1] 0.9024109
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$dunn
#> [1] 0.07203434
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$silhouette
#> [1] 0.4593925
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$infomax
#> [1] 1
#>
#>
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$time
#> [1] 0.004937172
res_only_best$metrics
#> Purity Fmeasure RI ARI Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI 1 1 1 1 0.004937172
This can be useful if you want to use the function as if it were a typical clustering method.
When vars_combinations = "auto"
, EHyClus
will automatically select an optimal combination of variables through a
data-driven approach. This process works is comprised of two main
steps:
The first step focuses on identifying variables that show enough variation to be meaningful discriminators. For each variable, the function calculates how many unique values it contains relative to its total number of observations. Only variables where this ratio exceeds 50% are retained. This ensures that we focus on variables that have sufficient variation to be useful in distinguishing between different groups.
The second step deals with redundancy in our selected variables. After all, having multiple variables that essentially contain the same information won’t improve our clustering results. To address this, the function examines how correlated our remaining variables are with each other. If two variables show a correlation higher than 0.75, we keep only one of them. Specifically, we look at how correlated each variable is with all other variables on average, and remove the one that shows higher average correlation. This helps ensure our final set of variables provides unique, non-redundant information for clustering.
Here’s an example using automatic variable selection:
# Generate sample data
n <- 50
curves <- sim_model_ex2(n = n, i_sim = 4)
true_labels <- c(rep(1, n), rep(2, n))
# Use automatic variable selection
res_auto <- EHyClus(curves,
vars_combinations = "auto",
true_labels = true_labels)
# Look at which variables were selected
attr(res_auto, "vars_combinations")[[1]]
#> [1] "ddtaMEI" "dtaMHI" "ddtaMHI" "d2dtaMHI"