Introduction to dtreg

2. Create an instance

To write down your data in accordance with a DTR schema, dtreg uses R6 classes. Therefore, you need to create an instance of a specific class.

Fields

For doing that, you first need to know which fields you can use for the selected schema:

dtreg::show_fields(dt$group_comparison())

For example, this schema has the field “label”. If your instance included only a label, it would be:

labelled_inst <- dt$group_comparison(label = "my_test_results")

However, the data you want to write are usually more complex than this. While creating an instance, please pay attention to fields you use and types of input required by these fields (e.g., another schema or a specific type of data, such as numeric). Most analytical schemata in the ePIC DTR have the same fields: “label”, “executes”, “has_input”, and “has_output”. In addition, some have “targets” and “level” fields.

Most common types of input

The most frequently used types of input are strings, numeric, and data frames. We briefly outline how to write them in instance fields.

Strings

Strings are used for labels and comments, and URLs are also presented as strings:

method_url <- dt$software_method(has_support_url = "https://search.r-project.org/R/refmans/stats/html/00Index.html")

A special case is a string used in the field “is_implemented_by”. For the overall report in the data_analysis schema, it is simply a URL string of the code location (e.g., a GitHub repository). For other schemata, it is a string of code implementing a specific test. In most cases, it is one line:

method_line <-
  dt$software_method(is_implemented_by = "stats::wilcox.test(var_1, var_2)")

Sometimes, you want to report more than one line. In this case, please follow the convention:

method_lines <-
  dt$software_method(is_implemented_by = paste("first line of code",
                                               "second line of code",
                                               sep = "\n"))

Numeric

Numeric values are also frequently used:

dimensions <- dt$matrix_size(number_of_rows = 100,
                             number_of_columns = 5)

Data frames and tuples

Some fields, such as “source_table”, require a data frame:

my_dataframe <- data.frame(W = 44.5, p = 2.2e-16)
output_dataframe <- dt$data_item(source_table = my_dataframe)

Please check which elements you include in your data frame and how you assign the columns. When in doubt, look at your column with:

class(my_dataframe$W)

The result should be either a basic R data type or an R factor.

By default, dtreg assigns a data frame a generic label “Table”. If you want to give your data frame a label, you can use a tuple. Please always include first the data frame and then the name as a string in the tuple:

library(sets)
my_tuple <- sets::tuple(my_dataframe, "the Wilcoxon test results")
output_tuple <- dt$data_item(source_table = my_tuple)

These are the most frequently used types of input that you write in a field.

More than one input in a field

Sometimes a few objects should be written in one field. In this case, simply use a list:

var_1 <- dt$component(label = "var_1")
var_2 <- dt$component(label = "var_2")
two_vars <- dt$group_comparison(targets = list(var_1, var_2))

Nested structure

In the example above you can see that a field of a schema might require another schema. This nested structure is important for making the data machine readable. The example above can be also written this way, and you can choose which is more convenient for you:

two_vars <-
  dt$group_comparison(targets = list(dt$component(label = "var_1"),
                                     dt$component(label = "var_2")))

General remarks on writing an instance

Please be attentive when writing the results. In case you misspell a variable, omit a comma separating two fields, forget a closing bracket, or make another seemingly tiny mistake, you will get an error message (something like “object ‘inputt’ not found”). You can experiment with such typos and see which error messages you get.

None of the fields are mandatory: you will not get an error message if you leave any field, or all of them, empty. It makes sense, though, to create a useful JSON-LD file.

4. Example: reporting data analysis with dtreg

In this vignette, we show reporting data analysis results with dtreg. The following code was simplified for illustration purposes:

attach(iris)
virg <- iris[iris$Species == "virginica", ]
vers <- iris[iris$Species == "versicolor", ]
wilc <- stats::wilcox.test(vers$Petal.Length, virg$Petal.Length)
regr <- summary(stats::lm(Petal.Length ~ Petal.Width, data = virg))

Here, we selected virginica and versicolor rows of the Iris dataset as two separate data frames; conducted a Wilcoxon test comparing petal length of these species; and explored the relationship between petal length and petal width in virginica with a simple linear regression (SLR).

Choose parts of analysis to address and schemata to load

First of all, we select schemata based on what analytic methods we report. Let us omit preprocessing, which is rather trivial in this case, and focus on the Wilcoxon test and SLR. For them, as the help page specifies, we need the group_comparison and regression_analysis schemata. Also, for any data analysis, the overarching data_analysis schema is required. Let us load these (see Load a DTR schema above):

dt_wilc <-
  dtreg::load_datatype("https://doi.org/21.T11969/b9335ce2c99ed87735a6")
dt_regr <-
  dtreg::load_datatype("https://doi.org/21.T11969/286991b26f02d58ee490")
dt_all <-
  dtreg::load_datatype("https://doi.org/21.T11969/feeb33ad3e4440682a4d")

Now we can prepare to write the instances.

Prepare data frames with test results

Reporting the data analysis results transparently is important, and there is a trade-off between concise and comprehensive presentation. For the Wilcoxon test, this data frame will suffice:

wilc_result <- data.frame(W = 44.5,
                          p = 2.2e-16,
                          stringsAsFactors = FALSE)
rownames(wilc_result) <- "value"

For the SLR, we have more information from the summary object that is crucial to report:

regr_coeff <- data.frame(regr$coefficients)
regr_model <-
  data.frame(
    F = 5.557,
    numdf = 1,
    dendf = 48,
    p = 0.02254,
    r.squared = 0.1038,
    adj.r.squared = 0.08508,
    stringsAsFactors = FALSE
  )
rownames(regr_model) <- "value"

Please remember that this is a simplistic example, and in real statistical tests you might want to report more details (e.g., the effect size for the Wilcoxon test).

Find out which schemata parts can be reused

Schemata such as the data_item or software_method are parts used in all analytical schemata. When an instance of such a class is converted into JSON-LD format, it does not matter, to which larger analytical schema it belongs. This can be easily checked with the following code:

inst_1 <- dt_wilc$data_item()
json_1 <- dtreg::to_jsonld(inst_1)
inst_2 <- dt_regr$data_item()
json_2 <- dtreg::to_jsonld(inst_2)
identical(json_1, json_2)

In this example, the Wilcoxon test and the SLR have the same input data, software library (“stats” in R), and target variable (“Petal.Length”). Therefore, some parts of the Wilcoxon instance can be reused in the SLR instance. These are (i) the data_item instance describing the Iris dataset; (ii) the software_library instance describing the software (R) and the package (“stats”); and (iii) the component instance indicating that our target variable is “Petal.Length”:

data_iris <- dt_wilc$data_item(
  label = "iris",
  has_characteristic = dt_wilc$matrix_size(number_of_rows = 150,
                                           number_of_columns = 5)
)
software <- dt_wilc$software(label = "R",
                             versioninfo = "4.3.1")
soft_library <- dt_wilc$software_library(
  label = "stats",
  part_of = software,
  versioninfo = "4.3.1",
  has_support_url = "https://search.r-project.org/R/refmans/stats/html/00Index.html"
)
petal_length <- dt_wilc$component(label = "Petal.Length")

Write all instances with the information specified above

Let us create the Wilcoxon instance:

soft_method_wilc <-
  dt_wilc$software_method(label = "stats::wilcoxon",
                          part_of = soft_library,
                          is_implemented_by =
                          "stats::wilcox.test(vers$Petal.Length, virg$Petal.Length)")
output_wilc <- dt_wilc$data_item(source_table = wilc_result)
instance_wilc <- dt_wilc$group_comparison(
  label = "Wilcoxon Petal.Length, virg vs vers",
  executes = soft_method_wilc,
  has_input = data_iris,
  targets = petal_length,
  has_output = output_wilc
)

Then we create the SLR instance:

soft_method_regr <-
  dt_regr$software_method(label = "stats::lm",
                          part_of = soft_library,
                          is_implemented_by =
                          "summary(stats::lm(Petal.Length ~ Petal.Width, data = virg))")
output_regr <-
  dt_regr$data_item(source_table = list(regr_coeff, regr_model))
instance_regr <- dt_regr$regression_analysis(
  label = "SLR Petal.Length vs Petal.Width, virg",
  executes = soft_method_regr,
  has_input = data_iris,
  targets = petal_length,
  has_output = output_regr
)

Write the data_analysis instance and convert into JSON-LD format

Now, we include both instances in the final data_analysis instance:

instance_all <- dt_all$data_analysis(
  label = "my_data_analysis",
  is_implemented_by = "my_github_link",
  has_part = list(instance_wilc, instance_regr)
)

Finally, we convert the data_analysis instance into JSON-LD format and save the file:

instance_all_json <- dtreg::to_jsonld(instance_all)
write(instance_all_json, "instance_all_file.json")

Additional comments

The mrap package is a forthcoming R package based on dtreg that will simplify the process of producing machine-readable results by packing this code into just a couple of lines. This new package will be available in CRAN soon. While the goal of mrap is to streamline data analysis workflows, it’s worth noting that manually structuring your results using dtreg still has its benefits. For example, it gives you maximum control over how you would like to present and structure your results.