Translating complex Presto data types

In this vignette, we demonstrate how complex structural types in Presto can be translated into R types (e.g., vectors, list, and tibbles).

Package setup

library(RPresto)

You can check your RPresto version by running the packageVersion() function. You need version 1.3.9 or later to have a more comprehensive and robust complex types support.

packageVersion("RPresto")
#> [1] '1.4.7'

Overview

Complex types refer to structural types including ARRAY, MAP and ROW. Those data types are basically containers to hold other data types (hence complex).

We summarize the similarities and differences between the complex types below.

Type Is atomic? Has names/keys?
ARRAY Yes No
MAP Yes Yes
ROW No Yes

Atomic here means all elements in the container share the same type (usually primitive types, but can be complex types too). For example, an ARRAY of integer can only hold integer type elements whereas a ROW can have elements of different types (e.g., one element is integer and the other boolean).

Walkthrough preparation

Local Presto server in memory

We assume that the user already have a Presto server with a memory connector set up. If you don’t have such a server set up, refer to the Presto documentation for instructions if you want to follow along.

Presto connection in R

We first create a PrestoConnection which will serve as the bridge between the Presto server and R.

con <- DBI::dbConnect(
  drv = RPresto::Presto(),
  host = "http://localhost",
  port = 8080,
  user = Sys.getenv("USER"),
  catalog = "memory",
  schema = "default"
)

We first issue a simple query to see if the Presto connection is working properly.

DBI::dbGetQuery(con, "SELECT 1+1 AS res")
#> # A tibble: 1 × 1
#>     res
#>   <int>
#> 1     2

Atomic structural types that contain primitive types

Presto type R type
ARRAY unnamed typed vector
MAP named typed vector

ARRAYs

Given the atomic and unnamed nature of ARRAYs, we map them to unnamed typed vectors in R.

Create a table with ARRAYs of primitive types

We first create a table with ARRAYs of all supported primitive Presto data types using the create_primitive_arrays_table() function included in the RPresto package.

RPresto:::create_primitive_arrays_table(
  con, table_name = "presto_primitive_arrays", verbose = FALSE
)

We can check if the table now exists in Presto.

DBI::dbExistsTable(con, "presto_primitive_arrays")
#> [1] TRUE

We can list the fields in the table. They are named after the Presto types they represent.

DBI::dbListFields(con, "presto_primitive_arrays")
#>  [1] "array_boolean"                "array_tinyint"               
#>  [3] "array_smallint"               "array_integer"               
#>  [5] "array_bigint"                 "array_real"                  
#>  [7] "array_double"                 "array_decimal"               
#>  [9] "array_varchar"                "array_char"                  
#> [11] "array_varbinary"              "array_date"                  
#> [13] "array_time"                   "array_time_with_tz"          
#> [15] "array_timestamp"              "array_timestamp_with_tz"     
#> [17] "array_interval_year_to_month" "array_interval_day_to_second"

Translate ARRAYs to R vectors

(
  df.array_of_primitive_types <- dbGetQuery(
    con,
    "SELECT * FROM presto_primitive_arrays",
    bigint = "integer64"
  )
)
#> # A tibble: 1 × 18
#>   array_boolean array_tinyint array_smallint array_integer array_bigint
#>   <list>        <list>        <list>         <list>        <list>      
#> 1 <lgl [3]>     <int [3]>     <int [3]>      <int [3]>     <int64 [3]> 
#> # ℹ 13 more variables: array_real <list>, array_double <list>,
#> #   array_decimal <list>, array_varchar <list>, array_char <list>,
#> #   array_varbinary <list>, array_date <list>, array_time <list>,
#> #   array_time_with_tz <list>, array_timestamp <list>,
#> #   array_timestamp_with_tz <list>, array_interval_year_to_month <list>,
#> #   array_interval_day_to_second <list>

We can verify the R types of each column.

tibble::enframe(purrr::map_chr(df.array_of_primitive_types, ~class(.[[1]])[1]))
#> # A tibble: 18 × 2
#>    name                         value    
#>    <chr>                        <chr>    
#>  1 array_boolean                logical  
#>  2 array_tinyint                integer  
#>  3 array_smallint               integer  
#>  4 array_integer                integer  
#>  5 array_bigint                 integer64
#>  6 array_real                   numeric  
#>  7 array_double                 numeric  
#>  8 array_decimal                character
#>  9 array_varchar                character
#> 10 array_char                   character
#> 11 array_varbinary              list     
#> 12 array_date                   Date     
#> 13 array_time                   hms      
#> 14 array_time_with_tz           hms      
#> 15 array_timestamp              POSIXct  
#> 16 array_timestamp_with_tz      POSIXct  
#> 17 array_interval_year_to_month Duration 
#> 18 array_interval_day_to_second Duration

All vectors are unnamed.

purrr::every(df.array_of_primitive_types, ~is.null(names(.[[1]])))
#> [1] TRUE

We can also call functions such as length() on each of the columns to get the ARRAY cardinality. It shows that all ARRAYs have 3 elements in them.

tibble::enframe(purrr::map_int(df.array_of_primitive_types, ~length(.[[1]])))
#> # A tibble: 18 × 2
#>    name                         value
#>    <chr>                        <int>
#>  1 array_boolean                    3
#>  2 array_tinyint                    3
#>  3 array_smallint                   3
#>  4 array_integer                    3
#>  5 array_bigint                     3
#>  6 array_real                       3
#>  7 array_double                     3
#>  8 array_decimal                    3
#>  9 array_varchar                    3
#> 10 array_char                       3
#> 11 array_varbinary                  3
#> 12 array_date                       3
#> 13 array_time                       3
#> 14 array_time_with_tz               3
#> 15 array_timestamp                  3
#> 16 array_timestamp_with_tz          3
#> 17 array_interval_year_to_month     3
#> 18 array_interval_day_to_second     3

MAPs

A MAP in Presto can be thought as a combination of two same-length ARRAYs. The first ARRAY contains the keys of the MAP and the second contains the values of the MAP. In fact, that’s exactly how MAP literals are created in Presto (e.g., MAP(ARRAY[1, 2], ARRAY['a', 'b']) creates a 2-element MAP).

Following the logic, we translate MAPs to named typed vectors in R.

Create a table with MAPS of primitive types

We first create a table with MAPSs of all supported primitive Presto data types using the create_primitive_maps_table() function included in the RPresto package.

RPresto:::create_primitive_maps_table(
  con, table_name = "presto_primitive_maps", verbose = FALSE
)

We can check if the table now exists in Presto.

DBI::dbExistsTable(con, "presto_primitive_maps")
#> [1] TRUE

We can list the fields in the table. They are named after the Presto types they represent.

DBI::dbListFields(con, "presto_primitive_maps")
#>  [1] "map_boolean"                "map_tinyint"               
#>  [3] "map_smallint"               "map_integer"               
#>  [5] "map_bigint"                 "map_real"                  
#>  [7] "map_double"                 "map_decimal"               
#>  [9] "map_varchar"                "map_char"                  
#> [11] "map_varbinary"              "map_date"                  
#> [13] "map_time"                   "map_time_with_tz"          
#> [15] "map_timestamp"              "map_timestamp_with_tz"     
#> [17] "map_interval_year_to_month" "map_interval_day_to_second"

Translate MAPs to R vectors

(
  df.map_of_primitive_types <- dbGetQuery(
    con,
    "SELECT * FROM presto_primitive_maps",
    bigint = "integer64"
  )
)
#> # A tibble: 1 × 18
#>   map_boolean map_tinyint map_smallint map_integer map_bigint  map_real 
#>   <list>      <list>      <list>       <list>      <list>      <list>   
#> 1 <lgl [3]>   <int [3]>   <int [3]>    <int [3]>   <int64 [3]> <dbl [3]>
#> # ℹ 12 more variables: map_double <list>, map_decimal <list>,
#> #   map_varchar <list>, map_char <list>, map_varbinary <list>, map_date <list>,
#> #   map_time <list>, map_time_with_tz <list>, map_timestamp <list>,
#> #   map_timestamp_with_tz <list>, map_interval_year_to_month <list>,
#> #   map_interval_day_to_second <list>

We can verify the R types of each column.

tibble::enframe(purrr::map_chr(df.map_of_primitive_types, ~class(.[[1]])[1]))
#> # A tibble: 18 × 2
#>    name                       value    
#>    <chr>                      <chr>    
#>  1 map_boolean                logical  
#>  2 map_tinyint                integer  
#>  3 map_smallint               integer  
#>  4 map_integer                integer  
#>  5 map_bigint                 integer64
#>  6 map_real                   numeric  
#>  7 map_double                 numeric  
#>  8 map_decimal                character
#>  9 map_varchar                character
#> 10 map_char                   character
#> 11 map_varbinary              list     
#> 12 map_date                   Date     
#> 13 map_time                   hms      
#> 14 map_time_with_tz           hms      
#> 15 map_timestamp              POSIXct  
#> 16 map_timestamp_with_tz      POSIXct  
#> 17 map_interval_year_to_month Duration 
#> 18 map_interval_day_to_second Duration

All vectors are named.

purrr::none(df.map_of_primitive_types, ~is.null(names(.[[1]])))
#> [1] TRUE

Repeated ARRAYs and MAPs

It’s possible to have repeated ARRAYs and MAPs in Presto in the form of ARRAYs of ARRAYs and ARRAYs of MAPs.

Repeated Presto type R type
ARRAY of ARRAY not supported
ARRAY of MAP unnamed list of named typed vectors

We are not supporting nested ARRAYs at the moment although it’s technically possible in Presto.

For ARRAYs of MAPs, we translate the ARRAY container into an unnamed list and each of the MAP element into a named typed vector.

We first create an array-of-maps table by using the create_array_of_maps_table() function.

RPresto:::create_array_of_maps_table(
  con, table_name = "presto_array_of_maps", verbose = FALSE
)

We can check if the table now exists in Presto.

DBI::dbExistsTable(con, "presto_array_of_maps")
#> [1] TRUE

We can list the fields in the table.

DBI::dbListFields(con, "presto_array_of_maps")
#>  [1] "array_map_boolean"                "array_map_tinyint"               
#>  [3] "array_map_smallint"               "array_map_integer"               
#>  [5] "array_map_bigint"                 "array_map_real"                  
#>  [7] "array_map_double"                 "array_map_decimal"               
#>  [9] "array_map_varchar"                "array_map_char"                  
#> [11] "array_map_varbinary"              "array_map_date"                  
#> [13] "array_map_time"                   "array_map_time_with_tz"          
#> [15] "array_map_timestamp"              "array_map_timestamp_with_tz"     
#> [17] "array_map_interval_year_to_month" "array_map_interval_day_to_second"

Let’s import all the data into R.

(
  df.array_of_maps <- dbGetQuery(
    con,
    "SELECT * FROM presto_array_of_maps",
    bigint = "integer64"
  )
)
#> # A tibble: 1 × 18
#>   array_map_boolean array_map_tinyint array_map_smallint array_map_integer
#>   <list>            <list>            <list>             <list>           
#> 1 <list [2]>        <list [2]>        <list [2]>         <list [2]>       
#> # ℹ 14 more variables: array_map_bigint <list>, array_map_real <list>,
#> #   array_map_double <list>, array_map_decimal <list>,
#> #   array_map_varchar <list>, array_map_char <list>,
#> #   array_map_varbinary <list>, array_map_date <list>, array_map_time <list>,
#> #   array_map_time_with_tz <list>, array_map_timestamp <list>,
#> #   array_map_timestamp_with_tz <list>,
#> #   array_map_interval_year_to_month <list>, …

We need to pry open the wrapping unnamed list to reveal the types of the vectors underneath.

tibble::enframe(purrr::map_chr(df.array_of_maps, ~class(.[[1]][[1]])[1]))
#> # A tibble: 18 × 2
#>    name                             value    
#>    <chr>                            <chr>    
#>  1 array_map_boolean                logical  
#>  2 array_map_tinyint                integer  
#>  3 array_map_smallint               integer  
#>  4 array_map_integer                integer  
#>  5 array_map_bigint                 integer64
#>  6 array_map_real                   numeric  
#>  7 array_map_double                 numeric  
#>  8 array_map_decimal                character
#>  9 array_map_varchar                character
#> 10 array_map_char                   character
#> 11 array_map_varbinary              list     
#> 12 array_map_date                   Date     
#> 13 array_map_time                   hms      
#> 14 array_map_time_with_tz           hms      
#> 15 array_map_timestamp              POSIXct  
#> 16 array_map_timestamp_with_tz      POSIXct  
#> 17 array_map_interval_year_to_month Duration 
#> 18 array_map_interval_day_to_second Duration

ROW type

The easiest way to think about the ROW type in Presto is to think of it literally as a row of a table. Just as a table can have multiple columns of different data types, a ROW can have multiple elements of different types. And just like a table having a name for each column, every element of a ROW has a name associated with the value.

Depending on whether the ROW type is repeated (i.e., wrapped in an ARRAY), the translation into R is different.

Presto type R type
Single ROW named list
Repeated ROWs tibble

Single ROW translation

To demonstrate how ROW types are translated into R types, we first create a table using an auxiliary create_primitive_rows_table() function included in the package. The resulting table has only 1 column named row_primitive_types which is a ROW that includes 18 sub-columns representing all supported primitive types.

RPresto:::create_primitive_rows_table(
  con, table_name = "presto_primitive_rows", verbose = FALSE
)

We can check if the table now exists in Presto.

DBI::dbExistsTable(con, "presto_primitive_rows")
#> [1] TRUE

We can list the fields in the table.

DBI::dbListFields(con, "presto_primitive_rows")
#> [1] "row_primitive_types"

We can then retrieve all the data from the table.

(
  df.row_of_primitive <- dbGetQuery(
    con,
    "SELECT row_primitive_types FROM presto_primitive_rows",
    bigint = "integer64"
  )
)
#> # A tibble: 3 × 1
#>   row_primitive_types
#>   <list>             
#> 1 <named list [18]>  
#> 2 <named list [18]>  
#> 3 <named list [18]>

We can check the R types of each element in the named list.

tibble::enframe(
  purrr::map_chr(df.row_of_primitive$row_primitive_types[[1]], ~class(.)[1])
)
#> # A tibble: 18 × 2
#>    name                   value    
#>    <chr>                  <chr>    
#>  1 boolean                logical  
#>  2 tinyint                integer  
#>  3 smallint               integer  
#>  4 integer                integer  
#>  5 bigint                 integer64
#>  6 real                   numeric  
#>  7 double                 numeric  
#>  8 decimal                character
#>  9 varchar                character
#> 10 char                   character
#> 11 varbinary              list     
#> 12 date                   Date     
#> 13 time                   hms      
#> 14 time_with_tz           hms      
#> 15 timestamp              POSIXct  
#> 16 timestamp_with_tz      POSIXct  
#> 17 interval_year_to_month Duration 
#> 18 interval_day_to_second Duration

Repeated ROWs translation

To construct a repeated ROW column, we use the auxiliary create_array_of_rows_table() function.

RPresto:::create_array_of_rows_table(
  con, table_name = "presto_array_of_rows", verbose = FALSE
)

We can check if the table now exists in Presto and the field name.

DBI::dbExistsTable(con, "presto_array_of_rows")
#> [1] TRUE
DBI::dbListFields(con, "presto_array_of_rows")
#> [1] "array_of_rows"

We can import the whole data into R.

(
  df.array_of_rows <- dbGetQuery(
    con,
    "SELECT array_of_rows FROM presto_array_of_rows",
    bigint = "integer64"
  )
)
#> # A tibble: 3 × 1
#>   array_of_rows    
#>   <list>           
#> 1 <tibble [2 × 18]>
#> 2 <tibble [2 × 18]>
#> 3 <tibble [2 × 18]>

We can verify the tibble’s column types.

tibble::enframe(
  purrr::map_chr(df.array_of_rows$array_of_rows[[1]], ~class(.)[1])
)
#> # A tibble: 18 × 2
#>    name                   value    
#>    <chr>                  <chr>    
#>  1 boolean                logical  
#>  2 tinyint                integer  
#>  3 smallint               integer  
#>  4 integer                integer  
#>  5 bigint                 integer64
#>  6 real                   numeric  
#>  7 double                 numeric  
#>  8 decimal                character
#>  9 varchar                character
#> 10 char                   character
#> 11 varbinary              list     
#> 12 date                   Date     
#> 13 time                   hms      
#> 14 time_with_tz           hms      
#> 15 timestamp              POSIXct  
#> 16 timestamp_with_tz      POSIXct  
#> 17 interval_year_to_month Duration 
#> 18 interval_day_to_second Duration

Walkthrough wrap-up

DBI::dbDisconnect(con)
#> [1] TRUE