Word2Vec
Transforms vectors of words into vectors of numeric codes for the purpose of further
processing by NLP or machine learning algorithms.
This operation is ported from Spark ML.
For a comprehensive introduction, see
Spark documentation.
For scala docs details, see
org.apache.spark.ml.feature.Word2Vec documentation.
Since: Seahorse 1.0.0
Port |
Type Qualifier |
Description |
0 | DataFrame | The input DataFrame . |
Output
Port |
Type Qualifier |
Description |
0 | DataFrame | The output DataFrame . |
1 | Transformer | A Transformer that allows to apply the operation on other DataFrames using a Transform. |
Parameters
Name |
Type |
Description |
input column |
SingleColumnSelector |
The input column name. |
output |
SingleChoice |
Output generation mode. Possible values: ["replace input column", "append new column"] |
max iterations |
Numeric |
The maximum number of iterations. |
step size |
Numeric |
The step size to be used for each iteration of optimization. |
seed |
Numeric |
The random seed. |
vector size |
Numeric |
The dimension of codes after transforming from words. |
num partitions |
Numeric |
The number of partitions for sentences of words. |
min count |
Numeric |
The minimum number of occurences of a token to be included in the model's vocabulary. |
Example
Parameters
Name |
Value |
input column |
"words" |
output |
append new column |
output column |
"vectors" |
max iterations |
1.0 |
step size |
0.025 |
seed |
0.0 |
vector size |
5.0 |
num partitions |
1.0 |
min count |
2.0 |
words |
[Lorem,ipsum,at,dolor] |
[Nullam,gravida,non,ipsum] |
[Etiam,at,nunc,lacinia] |
Output
words |
vectors |
[Lorem,ipsum,at,dolor] |
[-0.005178218358196318,0.006232232786715031,-3.91125213354826E-4,0.018661257810890675,-0.023597532883286476] |
[Nullam,gravida,non,ipsum] |
[8.919694228097796E-4,0.002301964908838272,-0.006360208615660667,0.023417502641677856,-0.016035044565796852] |
[Etiam,at,nunc,lacinia] |
[-0.006070187781006098,0.003930267877876759,0.0059690834023058414,-0.004756244830787182,-0.007562488317489624] |