Tokenize With Regex

Splits text using a regular expression.

This operation is ported from Spark ML.

For a comprehensive introduction, see Spark documentation.

For scala docs details, see org.apache.spark.ml.feature.RegexTokenizer documentation.

Since: Seahorse 1.0.0

Input

Port Type Qualifier Description
0DataFrameThe input DataFrame.

Output

Port Type Qualifier Description
0DataFrameThe output DataFrame.
1TransformerA Transformer that allows to apply the operation on other DataFrames using a Transform.

Parameters

Name Type Description
gaps Boolean Indicates whether the regex splits on gaps (true) or matches tokens (false).
min token length Numeric The minimum token length.
pattern String The regex pattern used to match delimiters (gaps = true) or tokens (gaps = false).
operate on InputOutputColumnSelector The input and output columns for the operation.

Example

Parameters

Name Value
gaps true
min token length 1.0
pattern "\\s+"
operate on one column
input column "sentence"
output append new column
output column "tokenized"

Input

label sentence
0 Hi I heard about Spark
1 I wish Java could use case classes
2 Logistic,regression,models,are,neat

Output

label sentence tokenized
0 Hi I heard about Spark [hi,i,heard,about,spark]
1 I wish Java could use case classes [i,wish,java,could,use,case,classes]
2 Logistic,regression,models,are,neat [logistic,regression,models,are,neat]