Count Vectorizer
Extracts the vocabulary from a given collection of documents and generates a vector
of token counts for each document.
This operation is ported from Spark ML.
For a comprehensive introduction, see
Spark documentation.
For scala docs details, see
org.apache.spark.ml.feature.CountVectorizer documentation.
Since: Seahorse 1.0.0
Port |
Type Qualifier |
Description |
0 | DataFrame | The input DataFrame . |
Output
Port |
Type Qualifier |
Description |
0 | DataFrame | The output DataFrame . |
1 | Transformer | A Transformer that allows to apply the operation on other DataFrames using a Transform. |
Parameters
Name |
Type |
Description |
input column |
SingleColumnSelector |
The input column name. |
output |
SingleChoice |
Output generation mode. Possible values: ["replace input column", "append new column"] |
max vocabulary size |
Numeric |
The maximum size of the vocabulary. |
min different documents |
Numeric |
Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. |
min term frequency |
Numeric |
A filter to ignore rare words in a document. For each document, terms with
a frequency/count less than the given threshold are ignored. If this is an integer >= 1,
then this specifies a count (of times the term must appear in the document); if this is
a double in [0,1), then it specifies a fraction (out of the document's token count).
Note that the parameter is only used in transform of CountVectorizer model and does not
affect fitting. |
Example
Parameters
Name |
Value |
input column |
"lines" |
output |
append new column |
output column |
"lines_out" |
max vocabulary size |
262144.0 |
min different documents |
1.0 |
min term frequency |
3.0 |
lines |
[a,a,a,b,b,c,c,c,d] |
[c,c,c,c,c,c] |
[a] |
[e,e,e,e,e] |
Output
lines |
lines_out |
[a,a,a,b,b,c,c,c,d] |
(5,[0,2],[3.0,3.0]) |
[c,c,c,c,c,c] |
(5,[0],[6.0]) |
[a] |
(5,[],[]) |
[e,e,e,e,e] |
(5,[1],[5.0]) |