merge {data.table} | R Documentation |
Fast merge of two data.table
s.
This merge
method for data.table
behaves very similarly to that
of data.frame
s with one major exception: By default,
the columns used to merge the data.table
s are the shared key columns
rather than the shared columns with the same names. Set the by
, or by.x
,
by.y
arguments explicitly to override this default.
## S3 method for class 'data.table' merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), allow.cartesian=getOption("datatable.allow.cartesian"), # default FALSE ...)
x, y |
|
by |
A vector of shared column names in |
by.x, by.y |
Vectors of column names in |
all |
logical; |
all.x |
logical; if |
all.y |
logical; analogous to |
sort |
logical. If |
suffixes |
A |
allow.cartesian |
See |
... |
Not used at this time. |
merge
is a generic function in base R. It dispatches to either the
merge.data.frame
method or merge.data.table
method depending on the class of its first argument.
In versions < v1.9.6
, if the specified columns in by
was not the key (or head of the key) of x
or y
, then a copy
is first rekeyed prior to performing the merge. This was less performant and memory inefficient.
In version v1.9.4
secondary keys was implemented. In v1.9.6
, the concept of secondary keys has been
extended to merge
. No deep copies are made anymore and therefore very performant and memory efficient. Also there is better control for providing the columns to merge on with the help of newly implemented by.x
and by.y
arguments.
For a more data.table
-centric way of merging two data.table
s, see [.data.table
; e.g., x[y, ...]
. See FAQ 1.12 for a detailed comparison of merge
and x[y, ...]
.
Merges on numeric columns: Columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by defalult, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. For large numbers (integers > 2^31), we recommend using bit64::integer64
. Have a look at setNumericRounding
to learn more.
A new data.table
based on the merged data table
s, sorted by the
columns set (or inferred for) the by
argument.
data.table
, [.data.table
,
merge.data.frame
(dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A")) (dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A")) merge(dt1, dt2) merge(dt1, dt2, all = TRUE) (dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A")) (dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A")) merge(dt1, dt2, allow.cartesian=TRUE) (dt1 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(1:3, 2)], X = 1:6, key = "A,B")) (dt2 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(2:4, 2)], Y = 6:1, key = "A,B")) merge(dt1, dt2) merge(dt1, dt2, by="B", allow.cartesian=TRUE) # test it more: d1 <- data.table(a=rep(1:2,each=3), b=1:6, key="a,b") d2 <- data.table(a=0:1, bb=10:11, key="a") d3 <- data.table(a=0:1, key="a") d4 <- data.table(a=0:1, b=0:1, key="a,b") merge(d1, d2) merge(d2, d1) merge(d1, d2, all=TRUE) merge(d2, d1, all=TRUE) merge(d3, d1) merge(d1, d3) merge(d1, d3, all=TRUE) merge(d3, d1, all=TRUE) merge(d1, d4) merge(d1, d4, by="a", suffixes=c(".d1", ".d4")) merge(d4, d1) merge(d1, d4, all=TRUE) merge(d4, d1, all=TRUE) # new feature, no need to set keys anymore set.seed(1L) d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6) d2 <- data.table(a=2:0, z=10:12) merge(d1, d2, by="a") merge(d1, d2, by="a", all=TRUE) # new feature, using by.x and by.y arguments setnames(d2, "a", "b") merge(d1, d2, by.x="a", by.y="b") merge(d1, d2, by.x="a", by.y="b", all=TRUE) merge(d2, d1, by.x="b", by.y="a")