Benchmarking

Based on community detection to automatically classify the keywords, can utilize different algorithms for clustering. In this vignette, a benchmark is provided to show the difference for various algorithms on multiple sizes of networks.

First, we’ll load the needed packages.

library(akc)
library(dplyr)

Then, we prepare the needed data. The built-in data table biblio_data_table would be used here.

bibli_data_table %>% 
  keyword_clean() %>% 
  keyword_merge() -> clean_data

Next, a combination of network size and community detection algorithms are designed to be tested:

100:300 -> topn_sample
ls("package:akc") %>% 
  str_extract("^group.+") %>% 
  na.omit() %>% 
  setdiff(c("group_biconnected_component",
            "group_components",
            "group_optimal")) -> com_detect_fun_list

Finally, we’ll implement the computation and record the results.

all = tibble()
for(i in com_detect_fun_list){
    for(j in topn_sample){
      system.time({
        clean_data %>% 
          keyword_group(top = j,com_detect_fun = get(i)) %>% 
          as_tibble -> grouped_network_table
      }) %>% na.omit-> time_info
      grouped_network_table %>% nrow -> node_no
      grouped_network_table %>% distinct(group) %>% nrow -> group_no
      grouped_network_table %>% 
        count(group) %>% 
        summarise(mean(n)) %>% 
        .[[1]] -> group_avg_node_no
      grouped_network_table %>% 
        count(group) %>% 
        summarise(sd(n)) %>% 
        .[[1]] -> group_sd_node_no
      c(com_detect_fun = i, 
        topn = j,
        node_no = node_no,group_no = group_no,
        avg = group_avg_node_no,
        sd = group_sd_node_no,time_info[1:3]) %>% 
        bind_rows(all,.) -> all
    }
}

res = all %>% 
  mutate_at(2:9,function(x) as.numeric(x) %>% round(2)) %>% 
  distinct(com_detect_fun,node_no,.keep_all = T) %>% 
  select(-topn,-contains("self")) %>% 
  setNames(c("com_detect_fun","No. of total nodes","No. of total groups",
             "Average node number in each group","Standard deviation of node number",
             "Computer running time for keyword_group function")) 

The results are displayed in the following table.

knitr::kable(res)
com_detect_fun No. of total nodes No. of total groups Average node number in each group Standard deviation of node number Computer running time for keyword_group function
group_edge_betweenness 103 36 2.86 9.17 0.50
group_edge_betweenness 207 68 3.04 12.53 2.98
group_edge_betweenness 326 89 3.66 13.12 10.03
group_fast_greedy 103 5 20.60 8.17 0.17
group_fast_greedy 207 5 41.40 24.36 0.18
group_fast_greedy 326 6 54.33 34.77 0.19
group_infomap 103 1 103.00 NA 0.17
group_infomap 207 4 51.75 94.83 0.22
group_infomap 326 6 54.33 114.98 0.34
group_label_prop 103 1 103.00 NA 0.16
group_label_prop 207 1 207.00 NA 0.17
group_label_prop 326 1 326.00 NA 0.18
group_leading_eigen 103 4 25.75 9.57 0.17
group_leading_eigen 207 5 41.40 19.19 0.18
group_leading_eigen 326 7 46.57 35.15 0.22
group_louvain 103 5 20.60 12.14 0.16
group_louvain 207 8 25.88 14.11 0.17
group_louvain 326 9 36.22 19.08 0.18
group_spinglass 103 5 20.60 5.13 1.66
group_spinglass 207 8 25.88 13.38 4.04
group_spinglass 326 8 40.75 12.07 7.30
group_walktrap 103 103 1.00 0.00 0.16
group_walktrap 207 207 1.00 0.00 0.17
group_walktrap 326 326 1.00 0.00 0.17

The session information is displayed as below:

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
#>  [5] evaluate_0.16   highr_0.9       stringi_1.7.8   cachem_1.0.6   
#>  [9] rlang_1.0.4     cli_3.3.0       rstudioapi_0.13 jquerylib_0.1.4
#> [13] bslib_0.4.0     rmarkdown_2.14  tools_4.2.1     stringr_1.4.0  
#> [17] xfun_0.32       yaml_2.3.5      fastmap_1.1.0   compiler_4.2.1 
#> [21] htmltools_0.5.3 knitr_1.39      sass_0.4.2