spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
Date Thu, 08 Jan 2015 19:40:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269933#comment-14269933
] 

Joseph K. Bradley edited comment on SPARK-1405 at 1/8/15 7:39 PM:
------------------------------------------------------------------

Hi all, there are several possible Spark LDA implementations out there (in PRs or public Github
repos), and I believe the best thing to do is to:
* settle on a simple API + implementation to start with
* switch existing PRs which use alternate algorithms (EM, Gibbs sampling, variational EM,
etc.) to use the same interface, where the inference algorithm can be set via a parameter

Towards this goal, I've written [this design doc | https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]
which focuses on the API rather than algorithm design.  I'm also preparing a PR based on the
simplest implementation I have been able to find, written by [~dlwh].  I should be able to
submit it in a day or so.  It uses (non-variational) EM, which should be fast albeit maybe
not as accurate as Gibbs sampling.

I'd of course appreciate feedback on the design doc, as well as the actual PR.  It will be
great to settle on a public API which can satisfy the many existing implementations of LDA
in Spark.

When we merge the initial LDA PR, [~mengxr] will be sure to include all of those who have
participated as authors of Spark LDA PRs: [~akopich], [~witgo], [~yinxusen], [~dlwh], Pedro,
[~jegonzal]



was (Author: josephkb):
Hi all, there are several possible Spark LDA implementations out there (in PRs or public Github
repos), and I believe the best thing to do is to:
* settle on a simple API + implementation to start with
* switch existing PRs which use alternate algorithms (EM, Gibbs sampling, variational EM,
etc.) to use the same interface, where the inference algorithm can be set via a parameter

Towards this goal, I've written [this design doc](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing)
which focuses on the API rather than algorithm design.  I'm also preparing a PR based on the
simplest implementation I have been able to find, written by [~dlwh].  I should be able to
submit it in a day or so.  It uses (non-variational) EM, which should be fast albeit maybe
not as accurate as Gibbs sampling.

I'd of course appreciate feedback on the design doc, as well as the actual PR.  It will be
great to settle on a public API which can satisfy the many existing implementations of LDA
in Spark.

When we merge the initial LDA PR, [~mengxr] will be sure to include all of those who have
participated as authors of Spark LDA PRs: [~akopich], [~witgo], [~yinxusen], [~dlwh], Pedro,
[~jegonzal]


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from
text corpus. Different with current machine learning algorithms in MLlib, instead of using
optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs
sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles
API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message