make.input.format {rmr2}R Documentation

Create combinations of settings for flexible IO

Description

Create combinations of IO settings either from named formats or from a combination of a Java class, a mode and an R function

Usage

make.input.format(format = "native", mode = c("binary", "text"), streaming.format = NULL, backend.parameters = NULL, ...)
make.output.format(format = "native", mode = c("binary", "text"), streaming.format = NULL, backend.parameters = NULL, ...)

Arguments

format

Either a string describing a predefined combination of IO settings (possibilities include: "text", "json", "csv", "native","sequence.typedbytes", "hbase", "pig.hive") or a function. For an input format, this function accepts a connection and a number of records and returns a key-value pair (see keyval). For an output format, this function accepts a key-value pair and a connection and writes the former to the latter.

mode

Mode can be either "text" or "binary", which tells R what type of connection to use when opening the IO connections.

streaming.format

Class to pass to hadoop streaming as inputformat or outputformat option. This class is the first in the input chain to perform its duties on the input side and the last on the output side. Right now this option is not honored in local mode.

backend.parameters

Sometimes the class passed to hadoop streaming (see the streaming.format argument) needs additional information to be properly configured which can be supplied to the streaming command line, typically with the -D option. This information can be specified with this argument, with a named list where the names are the option name, typically "D", and the values are "property=value" pairs as length-one character vectors, as in backend.parameters = list(D="property1=value1", D="property2=value2")

...

Additional arguments to the format function, which depend on the format being defined. All existing input formats accept a read.size (for binary) or nrow (for text) argument representing a number of bytes or number of rows to parse for each map call. These settings are used as suggestions and can not be honored exactly. Correctness of a program should not depend on their values. Custom formats created by users do not need to accept these arguments, but need to provide some other mechanism to read from the input connection in reasonably sized chunk. Moving to more format-specific arguments:

csv

Additional arguments detail the specifics of the CSV dialect to use and are the same as for read.table and write.table for the input and output resp, with the exception of header, file, x, nrows, col.names and row.names, the latter two allowed for input only.

json

Only for the input format, one can specify a key.class and a value.class to help in mapping the JSON data model to R's own more flexibly. An attempt will be made to cast the data to the specified classes.

native, sequence.typedbytes

write.size says how many values to map to a single typedbytes key-value pair when the key is NULL.

hbase

The name of the table to read from is provided as the input argument to mapreduce. Additional arguments are:

family.columns

a named list where the names are family names and the elements are lists of column names within each family;

key.deserialize and cell.deserialize

control the deserialization of keys and cells resp. and can take a string value or a function (explained below); allowed values are "raw", which means cells are text; "typdebytes", which is a serialization format shared with other elements of the Hadoop system; "native" which is the native R format; or a function that takes a list of raw vectors and returns a list or vector of deserialized objects. In the case of cell.deserialize the function should take two additional arguments for the names of family and column being deserialized

dense

controls whether the data read from hbase is returned as a 4-column data frame (key, family, column and cell) or a number of columns equal to the number of columns selected, plus one for the key;

atomic

controls whether the data frame columns are atomic or returned "as is", see I.

start.row and stop.row

limit input to a range of keys between the two supplied arguments, each optional; remember the order is based on the serialized representation of the keys

row.filter

specify a regular expression to filter the input table; remember the regular expression is matched against the serialized representation of the keys

avro

(input only) It has one mandatory additional argument, schema.file that should provide the URL of a file containing an appropriate avro schema, can be the same as file to be read. The user can specify the protocol, for instance file: or hdfs: as part of the URL, with the first being the default.

Details

The goal of these functions is to encapsulate some of the complexity of the IO settings, providing meaningful defaults and predefined combinations. If you don't want to deal with the full complexity of defining custom IO formats, there are prepackaged combinations.

text

is free text, useful mostly on the input side for NLP type applications;

json

is one or two tab separated, single line JSON objects per record;

csv

is the CSV format, configurable through additional arguments;

native

uses the internal R serialization, offers the highest level of compatibility with R data types and is the default;

sequence.typedbytes

is a sequence file (in the Hadoop sense) where key and value are of type typedbytes, which is a simple serialization format used in connection with streaming for compatibility with other hadoop subsystems. Typedbytes is documented here https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/typedbytes/package-summary.html.

hbase

allows to read from (but not yet write to) an HBase table. This format should still considered experimental. Hadoop should be already configured to run streaming jobs on HBase tables https://wiki.apache.org/hadoop/Hbase/MapReduce.

pig.hive

is a variant of CSV to transfer data to and from Hive or Pig, when using their default format `ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' LINES TERMINATED BY '\n'`.

avro

(input only) is the format defined by the Apache Avro project.

If you want to implement custom formats, the input processing is the result of the composition of a Java class and an R function, and the same is true on the output side but in reverse order and you can specify both as arguments to these functions.

Value

Return a list of IO specifications, to be passed as input.format and output.format to mapreduce, and as format to from.dfs (input) and to.dfs (output).

Examples

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.
make.input.format("csv", sep = ",")

[Package rmr2 version 3.3.1 Index]