spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yuhao yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
Date Fri, 09 Jan 2015 10:58:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270869#comment-14270869
] 

yuhao yang commented on SPARK-1405:
-----------------------------------

Great design doc and solid proposal. 

I noticed the online variational EM mentioned in the doc, for which I have developed a spark
implementation. The work was based on an actual customer scenario and has exhibited remarkable
speed and economized memory usage. The result is as good as the “batch” LDA, and with
handy support of stream text from the online nature. 

Right now we are turning it into graph-based and will perform further evaluation afterwards.
 The algorithm looks promising to us and can be helpful in many cases. For now I don’t find
online LDA will make the API design more complicated as it’s more like an incremental work.
Just want to bring up the possibility in case anyone finds a conflict.

Reference: [online LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf]
by [Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M.Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from
text corpus. Different with current machine learning algorithms in MLlib, instead of using
optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs
sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles
API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message