spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Debasish Das (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
Date Sun, 01 Mar 2015 16:52:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342311#comment-14342311
] 

Debasish Das edited comment on SPARK-5564 at 3/1/15 4:51 PM:
-------------------------------------------------------------

I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364

I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range sparse)
and so I am keen to compare the exact formulation in graphx based LDA flow...

Idea here is to solve the constrained factorization problem as explained in Vorontsov and
Potapenko:

minimize f(w,h*)
s.t 1'w = 1, w >=0 (row constraints)

minimize f(w*,h)
s.t 0 <= h <= 1, 

Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426
for low ranks and got good improvement in MAP statistics for recommendation workloads...Here
also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based factorization based flow
(ml.recommendation.ALS) with the graphx based LDA flow...

Infact if you think for large ranks, LDA based flow will be more efficient than join based
factorization flow, I can implement stochastic matrix factorization directly on top of LDA
flow and add both the least square and MAP losses...

I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and
there are counts on each edge...The solver will be run once on every node of each partition
after it collects the ratings and factors from it's edges..


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364

I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range) and
so I am keen to compare the exact formulation in graphx based LDA flow...

Idea here is to solve the constrained factorization problem as explained in Vorontsov and
Potapenko:

minimize f(w,h*)
s.t 1'w = 1, w >=0 (row constraints)

minimize f(w*,h)
s.t 0 <= h <= 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426
for low ranks and got good improvement in MAP statistics for recommendation workloads...Here
also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based factorization based flow
(ml.recommendation.ALS) with the graphx based LDA flow...

Infact if you think for large ranks, LDA based flow will be more efficient than join based
factorization flow, I can implement stochastic matrix factorization directly on top of LDA
flow and add both the least square and MAP losses...

I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and
there are counts on each edge...The solver will be run once on every node of each partition
after it collects the ratings and factors from it's edges..

> Support sparse LDA solutions
> ----------------------------
>
>                 Key: SPARK-5564
>                 URL: https://issues.apache.org/jira/browse/SPARK-5564
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration
parameters be > 1.0.  It should support values > 0.0, which should encourage sparser
topics (phi) and document-topic distributions (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko.
"Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix
Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message