spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
Date Fri, 09 Jan 2015 18:53:36 GMT


Joseph K. Bradley commented on SPARK-1405:

That's great to hear that online variational has worked well for you so far.  As far as API
design, I agree that the changes to the model API would be small if any.  I'm not as sure
about the Estimator (algorithm) API, but it could probably follow existing streaming ML algorithms.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>                 Key: SPARK-1405
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from
text corpus. Different with current machine learning algorithms in MLlib, instead of using
optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles
API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message