Count Vectorizer

Extracts the vocabulary from a given collection of documents and generates a vector of token counts for each document.

This operation is ported from Spark ML.

For a comprehensive introduction, see Spark documentation.

For scala docs details, see org.apache.spark.ml.feature.CountVectorizer documentation.

Since: Seahorse 1.0.0

Input

Port Type Qualifier Description
0DataFrameThe input DataFrame.

Output

Port Type Qualifier Description
0DataFrameThe output DataFrame.
1TransformerA Transformer that allows to apply the operation on other DataFrames using a Transform.

Parameters

Name Type Description
input column SingleColumnSelector The input column name.
output SingleChoice Output generation mode. Possible values: ["replace input column", "append new column"]
max vocabulary size Numeric The maximum size of the vocabulary.
min different documents Numeric Specifies the minimum number of different documents a term must appear in to be included in the vocabulary.
min term frequency Numeric A filter to ignore rare words in a document. For each document, terms with a frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then it specifies a fraction (out of the document's token count). Note that the parameter is only used in transform of CountVectorizer model and does not affect fitting.

Example

Parameters

Name Value
input column "lines"
output append new column
output column "lines_out"
max vocabulary size 262144.0
min different documents 1.0
min term frequency 3.0

Input

lines
[a,a,a,b,b,c,c,c,d]
[c,c,c,c,c,c]
[a]
[e,e,e,e,e]

Output

lines lines_out
[a,a,a,b,b,c,c,c,d] (5,[0,2],[3.0,3.0])
[c,c,c,c,c,c] (5,[0],[6.0])
[a] (5,[],[])
[e,e,e,e,e] (5,[1],[5.0])