Tokenize With Regex
Splits text using a regular expression.
This operation is ported from Spark ML.
For a comprehensive introduction, see
Spark documentation.
For scala docs details, see
org.apache.spark.ml.feature.RegexTokenizer documentation.
Since: Seahorse 1.0.0
Port |
Type Qualifier |
Description |
0 | DataFrame | The input DataFrame . |
Output
Port |
Type Qualifier |
Description |
0 | DataFrame | The output DataFrame . |
1 | Transformer | A Transformer that allows to apply the operation on other DataFrames using a Transform. |
Parameters
Name |
Type |
Description |
gaps |
Boolean |
Indicates whether the regex splits on gaps (true) or matches tokens (false). |
min token length |
Numeric |
The minimum token length. |
pattern |
String |
The regex pattern used to match delimiters (gaps = true) or tokens
(gaps = false). |
operate on |
InputOutputColumnSelector |
The input and output columns for the operation. |
Example
Parameters
Name |
Value |
gaps |
true |
min token length |
1.0 |
pattern |
"\\s+" |
operate on |
one column |
input column |
"sentence" |
output |
append new column |
output column |
"tokenized" |
label |
sentence |
0 |
Hi I heard about Spark |
1 |
I wish Java could use case classes |
2 |
Logistic,regression,models,are,neat |
Output
label |
sentence |
tokenized |
0 |
Hi I heard about Spark |
[hi,i,heard,about,spark] |
1 |
I wish Java could use case classes |
[i,wish,java,could,use,case,classes] |
2 |
Logistic,regression,models,are,neat |
[logistic,regression,models,are,neat] |