spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-5571) LDA should handle text as well
Date Tue, 03 Feb 2015 20:23:36 GMT
Joseph K. Bradley created SPARK-5571:
----------------------------------------

             Summary: LDA should handle text as well
                 Key: SPARK-5571
                 URL: https://issues.apache.org/jira/browse/SPARK-5571
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.3.0
            Reporter: Joseph K. Bradley


Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts.  It should
also supporting training and prediction using text (Strings).

This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].

There should be:
* runWithText() method which takes an RDD with a collection of Strings (bags of words).  This
will also index terms and compute a dictionary.
* dictionary parameter for when LDA is run with word count vectors
* prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is
commented out in LDA currently)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message