Tokenizer
Text tokenization utilities for breaking text into words, sentences, and n-grams
text := "Hello, world! This is a test.";
words := System.NLP.Tokenizer->WordTokenize(text);
sentences := System.NLP.Tokenizer->SentenceTokenize(text);
bigrams := System.NLP.Tokenizer->WordNGrams(words, 2);Operations
CharNGrams
Generates character-level n-grams from text
function : CharNGrams(text:String, n:Int) ~ String[]Parameters
| Name | Type | Description |
|---|---|---|
| text | String | input text |
| n | Int | size of n-grams |
Return
| Type | Description |
|---|---|
| String | array of n-gram strings |
SentenceTokenize
Tokenizes text into sentences using common sentence delimiters
function : SentenceTokenize(text:String) ~ String[]Parameters
| Name | Type | Description |
|---|---|---|
| text | String | input text to tokenize |
Return
| Type | Description |
|---|---|
| String | array of sentence tokens |