Latent Dirichlet Allocation (LDA), a topic model designed for text documents. LDA is given a collection of documents as input data, via the features column parameter. Each document is specified as a vector of length equal to the vocabulary size, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as Tokenize and Count Vectorizer can be useful for converting text to word count vectors.

This operation is ported from Spark ML.

For a comprehensive introduction, see Spark documentation.

For scala docs details, see org.apache.spark.ml.clustering.LDA documentation.

Since: Seahorse 1.1.0


This operation does not take any input.


Port Type Qualifier Description
0EstimatorAn Estimator that can be used in a Fit operation.


Name Type Description
checkpoint interval Numeric The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.
k Numeric The number of clusters to create.
max iterations Numeric The maximum number of iterations.
optimizer SingleChoice Optimizer or inference algorithm used to estimate the LDA model. Currently supported: Online Variational Bayes, Expectation-Maximization. Possible values: ["online", "em"]
subsampling rate Numeric Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent. Note that this should be adjusted in synchronization with `max iterations` so the entire corpus is used. Specifically, set both so that `max iterations` * `subsampling rate` >= 1. .
topic distribution column String Output column with estimates of the topic mixture distribution for each document (often called \"theta\" in the literature). Returns a vector of zeros for an empty document.
features column SingleColumnSelector The features column for model fitting.
seed Numeric The random seed.