[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:51 PM: ------------------------------------------------------------- I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364 I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range sparse) and so I am keen to compare the exact formulation in graphx based LDA flow... Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w >=0 (row constraints) minimize f(w*,h) s.t 0 <= h <= 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good improvement in MAP statistics for recommendation workloads...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge...The solver will be run once on every node of each partition after it collects the ratings and factors from it's edges.. was (Author: debasish83): I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364 I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range) and so I am keen to compare the exact formulation in graphx based LDA flow... Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w >=0 (row constraints) minimize f(w*,h) s.t 0 <= h <= 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good improvement in MAP statistics for recommendation workloads...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge...The solver will be run once on every node of each partition after it collects the ratings and factors from it's edges.. > Support sparse LDA solutions > ---------------------------- > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priorsâ€™ concentration parameters be > 1.0. It should support values > 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org