Column types

Hadley Wickham

2016-08-03

Column types

Currently, readr automatically recognises the following types of columns:

To recognise these columns, readr inspects the first 1000 rows of your dataset. This is not guaranteed to be perfect, but it’s fast and a reasonable heuristic. If you get a lot of parsing failures, you’ll need to re-read the file, either increasing guess_max to overriding the default choices as described below.

You can also manually specify other column types:

Use the col_types argument to override the default choices. There are two ways to use it:

When reading files interactively the first 20 rows of the col_spec() used are printed. option(readr.num_columns) can be used to change the number of columns to be printed, setting the value to 0 disables printing.

readr attaches the spec used for the file to the output. It can be retrieved by calling spec() on the object.

data <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )
data
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#> 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
#> 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> # ... with 22 more rows

# Every table returned has a spec attribute
s <- spec(data)
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

# Alternatively you can use a spec function instead, which will only read the
# first 1000 rows (user configurable with guess_max)
s <- spec_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

# Automatically set the default to the most common type
cols_condense(s)
#> cols(
#>   .default = col_integer(),
#>   mpg = col_double(),
#>   disp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double()
#> )

# If the spec has a default of skip then uses cols_only
s$default <- col_skip()
s
#> cols_only(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

# Otherwise set the default to the proper type
s$default <- col_character()
s
#> cols(
#>   .default = col_character(),
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

# The print method takes a n parameter to return only that number of columns
print(s, n = 5)
#> cols(
#>   .default = col_integer(),
#>   mpg = col_double(),
#>   disp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double()
#> )

# When reading this is set to 20 by default, set 
# options("readr.num_columns" = x) to change
options("readr.num_columns" = 5)
data <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   .default = col_integer(),
#>   mpg = col_double(),
#>   disp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double()
#> )
#> See spec(...) for full column specifications.

# Setting it to 0 disables printing
options("readr.num_columns" = 0)
data <- read_csv(readr_example("mtcars.csv"))

Spec types

Column parsers

As well as specifying how to parse a column from a file on disk, each of the col_xyz() functions has an equivalent parse_xyz() that parsers a character vector. These are useful for testing and examples, and for rapidly experimenting to figure out how to parse a vector given a few examples.

Base types

parse_logical(), parse_integer(), parse_double(), and parse_character() are straightforward parsers that produce the corresponding atomic vector.

Make sure to read vignette("locales") to learn how to deal with doubles.

Numbers

parse_integer() and parse_double() are strict: the input string must be a single number with no leading or trailing characters. parse_number() is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks. This makes it suitable for reading currencies and percentages:

parse_number(c("0%", "10%", "150%"))
#> [1]   0  10 150
parse_number(c("$1,234.5", "$12.45"))
#> [1] 1234.50   12.45

Note that guess_parser() will only guess that a string is a number if it has no leading or trailing characters (after trimming whitespace), otherwise it’s too prone to false positives. That means you’ll typically needed to explicitly supply the column type for number columns.

guess_parser("$1,234")
#> [1] "character"
guess_parser("1,234")
#> [1] "number"

Date times

readr supports three types of date/time data:

readr will guess date time fields if they’re in ISO8601 format:

parse_datetime("2010-10-01 21:45")
#> [1] "2010-10-01 21:45:00 UTC"
parse_date("2010-10-01")
#> [1] "2010-10-01"

Otherwise, you’ll need to specify the format yourself:

parse_datetime("1 January, 2010", "%d %B, %Y")
#> [1] "2010-01-01 UTC"
parse_datetime("02/02/15", "%m/%d/%y")
#> [1] "2015-02-02 UTC"

Factors

When reading a column that has a known set of values, you can read directly into a factor.

parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
#> [1] a b a
#> Levels: a b c

readr will never turn a character vector into a factor unless you explicitly ask for it.