2022-04-24
Idea of the mlapi
package is to provide guideline on how
to implement interfaces of the machine learning models in order to have
unified consistent flow. API design is mainly borrowed from very
successful python scikit-learn
package. At the moment scope
is limited to the following base classes:
mlapiEstimation
/mlapiEstimationOnline
-
models which implements supervised learning -
regression or classificationmlapiTransformation
/mlapiTransformationOnline
- models which learn transformations of the data. For
example model can learn TF-IDF on some matrix and apply it to the other
holdout matrixmlapiDecomposition
/mlapiDecompositionOnline
- models which decompose input matrix into two matrices
(usually low rank). A good example could be matrix factorization where
input matrix \(X\) decomposed into 2
matrices \(P\) and \(Q\) so \(X
\approx P Q\).All the base classes above suggest developer to implement set of methods and expose set of members. Developer should provide realization of the class which inherits from a corresponding base class above.
There are several agreements which helps to maintain consistent workflow.
mlapi
defines
models to be mutable and internally implemented as
R6
classes.model = SomeModel$new(param_1 = 1, param_2 = 10)
.fit
-
mlapiEstimation
fit_transform
-
mlapiTransformation
, mlapiDecomposition
partial_fit
-
mlapiEstimationOnline
,
mlapiTransformationOnline
,
mlapiDecompositionOnline
predict
-
mlapiEstimation
, mlapiEstimationOnline
transform
-
mlapiTransformation
,
mlapiTransformationOnline
, mlapiDecomposition
,
mlapiDecompositionOnline
mlapiDecomposition
/mlapiDecompositionOnline
model fitting field private$components_
should be
initialized (mind undescore at the end!). It should contain
matrix \(Q\) (as per
\(X \approx P Q\)).base
package and sparse matrices from Matrix
package.This allows us to create concise pipelines which easy to train and apply to new data (details in next section):
# transformer:
# scaler just divide each column by std_dev
= Scaler$new()
scaler
# decomposition:
# fits truncated SVD: X = U * S * V
# or rephrasing X = P * Q where P = U * sqrt(S); Q = sqrt(S) * V
# as a result trunc_svd$fit_transform(train) returns matrix P and learns matrix Q (stores inside model)
# when trunc_svd$transform(test) is called, model use matrix Q in order to find matrix P for `test` data
= SVD$new(rank = 16)
trunc_svd
# estimator:
# fit L1/L2 regularized logistic regression
= LogisticRegression(L1 = 0.1, L2 = 10) logreg
%>%
train fit_transform(scaler) %>%
fit_transform(trunc_svd) %>%
fit(logreg)
Now all models are fitted.
= test %>%
predictions transform(scaler) %>%
transform(trunc_svd) %>%
predict(logreg)
= R6::R6Class(
SimpleLinearModel classname = "mlapiSimpleLinearModel",
inherit = mlapi::mlapiEstimation,
public = list(
initialize = function(tol = 1e-7) {
$tol = tol
private$set_internal_matrix_formats(dense = "matrix", sparse = NULL)
super
},fit = function(x, y, ...) {
= super$check_convert_input(x)
x stopifnot(is.vector(y))
stopifnot(is.numeric(y))
stopifnot(nrow(x) == length(y))
$n_features = ncol(x)
private$coefficients = .lm.fit(x, y, tol = private$tol)[["coefficients"]]
private
},predict = function(x) {
stopifnot(ncol(x) == private$n_features)
%*% matrix(private$coefficients, ncol = 1)
x
}
),private = list(
tol = NULL,
coefficients = NULL,
n_features = NULL
) )
set.seed(1)
= SimpleLinearModel$new()
model = matrix(sample(100 * 10, replace = TRUE), ncol = 10)
x = sample(c(0, 1), 100, replace = TRUE)
y $fit(as.data.frame(x), y)
model= model$predict(x)
res1 # check pipe-compatible S3 interface
= predict(x, model)
res2 identical(res1, res2)
## [1] TRUE
= R6::R6Class(
TruncatedSVD classname = "TruncatedSVD",
inherit = mlapi::mlapiDecomposition,
public = list(
initialize = function(rank = 10) {
$rank = rank
private$set_internal_matrix_formats(dense = "matrix", sparse = NULL)
super
},fit_transform = function(x, ...) {
= super$check_convert_input(x)
x $n_features = ncol(x)
private= svd(x, nu = private$rank, nv = private$rank, ...)
svd_fit = svd_fit$d[seq_len(private$rank)]
sing_values = svd_fit$u %*% diag(x = sqrt(sing_values))
result $components_ = t(svd_fit$v %*% diag(x = sqrt(sing_values)))
privaterm(svd_fit)
rownames(result) = rownames(x)
colnames(private$components_) = colnames(x)
$fitted = TRUE
privateinvisible(result)
},transform = function(x, ...) {
if (private$fitted) {
stopifnot(ncol(x) == ncol(private$components_))
= tcrossprod(private$components_)
lhs = as.matrix(tcrossprod(private$components_, x))
rhs t(solve(lhs, rhs))
}else
stop("Fit the model first woth model$fit_transform()!")
}
),private = list(
rank = NULL,
n_features = NULL,
fitted = NULL
) )
set.seed(1)
= TruncatedSVD$new(2)
model = matrix(sample(100 * 10, replace = TRUE), ncol = 10)
x = model$fit_transform(x)
x_trunc dim(x_trunc)
## [1] 100 2
= model$transform(x)
x_trunc_2 sum(x_trunc_2 - x_trunc)
## [1] -9.428555e-12
# check pipe-compatible S3 interface
= transform(x, model)
x_trunc_2_s3 identical(x_trunc_2, x_trunc_2_s3)
## [1] TRUE