README

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

install.packages("textrecipes")

# install.packages("pak")
pak::pak("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable medium and artist.

library(recipes)
library(textrecipes)
library(modeldata)

data("tate_text")

okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
  step_tokenize(medium, artist) %>%
  step_stopwords(medium, artist) %>%
  step_tokenfilter(medium, artist, max_tokens = 10) %>%
  step_tfidf(medium, artist)

okc_obj <- okc_rec %>%
  prep()

str(bake(okc_obj, tate_text))
#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)
#>  $ tfidf_medium_colour     : num [1:4284] 2.31 0 0 0 0 ...
#>  $ tfidf_medium_etching    : num [1:4284] 0 0.86 0.86 0.86 0 ...
#>  $ tfidf_medium_gelatin    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_paint      : num [1:4284] 0 0 0 0 2.35 ...
#>  $ tfidf_medium_paper      : num [1:4284] 0 0.422 0.422 0.422 0 ...
#>  $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_print      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_silver     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_akram      : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_beuys      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_ferrari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_john       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_joseph     : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_león       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_richard    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_schütte    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_thomas     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_zaatari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...

Breaking changes

As of version 0.4.0, step_lda() no longer accepts character variables and instead takes tokenlist variables.

recipe(~text_var, data = data) %>%
  step_lda(text_var)

lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
  step_tokenize(text_var,
    custom_token = lda_tokenizer
  ) %>%
  step_lda(text_var)

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

textrecipes

Introduction

Installation

Example

Breaking changes

Contributing