Word2Vec

Transforms vectors of words into vectors of numeric codes for the purpose of further processing by NLP or machine learning algorithms.

This operation is ported from Spark ML.

For a comprehensive introduction, see Spark documentation.

For scala docs details, see org.apache.spark.ml.feature.Word2Vec documentation.

Since: Seahorse 1.0.0

Input

Port Type Qualifier Description
0DataFrameThe input DataFrame.

Output

Port Type Qualifier Description
0DataFrameThe output DataFrame.
1TransformerA Transformer that allows to apply the operation on other DataFrames using a Transform.

Parameters

Name Type Description
input column SingleColumnSelector The input column name.
output SingleChoice Output generation mode. Possible values: ["replace input column", "append new column"]
max iterations Numeric The maximum number of iterations.
step size Numeric The step size to be used for each iteration of optimization.
seed Numeric The random seed.
vector size Numeric The dimension of codes after transforming from words.
num partitions Numeric The number of partitions for sentences of words.
min count Numeric The minimum number of occurences of a token to be included in the model's vocabulary.

Example

Parameters

Name Value
input column "words"
output append new column
output column "vectors"
max iterations 1.0
step size 0.025
seed 0.0
vector size 5.0
num partitions 1.0
min count 2.0

Input

words
[Lorem,ipsum,at,dolor]
[Nullam,gravida,non,ipsum]
[Etiam,at,nunc,lacinia]

Output

words vectors
[Lorem,ipsum,at,dolor] [-0.005178218358196318,0.006232232786715031,-3.91125213354826E-4,0.018661257810890675,-0.023597532883286476]
[Nullam,gravida,non,ipsum] [8.919694228097796E-4,0.002301964908838272,-0.006360208615660667,0.023417502641677856,-0.016035044565796852]
[Etiam,at,nunc,lacinia] [-0.006070187781006098,0.003930267877876759,0.0059690834023058414,-0.004756244830787182,-0.007562488317489624]