NUSS (Mixed N-Grams and Unigram Sequence Segmentation) is an R package designed to segment and simplify sequences using synthetic n-grams sequences, and dynamic unigram sequence matching. This package is particularly useful for text processing and natural language processing (NLP) tasks.
You can install the development version of NUSS from GitHub using the following commands:
# Install devtools if you haven't already
install.packages("devtools")
# Install NUSS from GitHub
::install_github("theogrost/NUSS") devtools
Here are some basic examples to get you started with NUSS.
You can segment sequences using the ngrams_segmentation
function. First, create an n-grams dictionary, and then use the
dictionary to segment sequences.
library(NUSS)
# Segment a sequence using built-in dictionary
<- unigram_sequence_segmentation("thisisscience")
unigram_sequencer_segmented print(unigram_sequencer_segmented)
# Example text data
<- c(
texts "this is science",
"science is fascinating",
"this is a scientific approach",
"science is everywhere",
"the beauty of science"
)
# Segment a sequence using n-grams
<- ngrams_dictionary(texts)
ngrams_dict <- ngrams_segmentation("thisisscience", ngrams_dict)
ngrams_segmented print(ngrams_segmented)
# Segment a sequence using nuss - combined function
<- nuss("thisisscience", texts)
nuss_segmented print(segmented)
You can customize the segmentation process with additional parameters
such as simplify
, omit_zero
, and
score_formula
.
# Custom segmentation with additional parameters
<- ngrams_dictionary(texts, clean = TRUE, ngram_min = 1, ngram_max = 5, points_filter = 1)
ngrams_dict <- ngrams_segmentation(
custom_segmented "thisisscience",
ngrams_dict,simplify = TRUE,
omit_zero = TRUE,
score_formula = "points / words.number ^ 2"
)print(custom_segmented)
NUSS is licensed under the GPL-3.0 License.
For any questions or issues, please open an issue on GitHub or contact the maintainer.
I hope you find NUSS useful for your text processing tasks!