spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edwards, Brandon" <>
Subject Re: Spark EM LDA Optimizer support
Date Thu, 27 Jul 2017 15:59:42 GMT
There could be some reason why establishing the model using EM then performing further training
with online, having saved the initial model, is for some reason a good thing to do?

Gustavo I have a question, how do you see what Ricardo implemented as a solution to the fact
that under the hood retraining on Em in spark-ml? We still need to switch from EM to online
in order to score on unseen data right? I still don’t see how there is any other way than
to only train using EM on the first batch.

On 7/27/17, 7:42 AM, "Barona, Ricardo" <> wrote:

    I was wondering something similar the other day. What made the Apache Spark team offering
an option to convert EM resulting model into Local LDA Model (Online)? I’m talking about
    On 7/27/17, 9:02 AM, "Giacomo Bernardi" <> wrote:
        That's a very interesting development!
        However let me do a step back: why do we even need EM? From a user
        perspective, what would be the advantage of running anomaly detection on
        1-day batches rather than on a continuously online-learning model? I'm
        probably missing something because I don't see value for the latter use
        On 21 July 2017 at 20:06, Lujan Moreno, Gustavo <> wrote:
        > I would suggest supporting both for now. In my experiments online is
        > taking more iterations to converge (although I haven’t measured time,
        > online is supposed to be faster). The spark.mllib doesn’t allow to score
        > unseen records with EM, only train. The new does allow to train
        > with EM and score unseen documents with EM but Ricardo and I found that it
        > is really using online under the hood. I consider that to be a bug from
        > Spark side. Therefore, what Ricardo is suggesting is a workaround for this
        > bug.
        > On 7/21/17, 1:44 PM, "Barona, Ricardo" <> wrote:
        > >Once a saved model is loaded it needs to be converted to LocalLDAModel if
        > it’s a DistributedLDAModel but from what I heard, the importance of what
        > you used for training, EM and Online is in the topics matrix that generates
        > one and the other. I’m not exactly and expert but I’d think they are going
        > to be different, right? The topics matrix of a LocalLDAModel coming from
        > DistributedLDAModel will remain the same and topic distributions will be
        > calculated based on that.
        > >
        > >On 7/21/17, 1:26 PM, "Edwards, Brandon" <>
        > wrote:
        > >
        > >    A question just came up for me. Is there a true use case for
        > utilizing EM that allows one to carry context from previous models into the
        > future? It seems that once you save to a local model in order to utilize it
        > for future data, from then on you only can use the Online optimizer. If
        > this is correct, I vote for getting rid of EM. I don’t see value in
        > supporting a use case that does not carry context into future models.
        > >
        > >    On 7/21/17, 11:08 AM, "Barona, Ricardo" <>
        > wrote:
        > >
        > >        During the last 9 days, I've been working on modifying Apache
        > Spot LDA wrapper to enable the possibility of saving models and load
        > existing models and then get topic distributions for the same corpus or for
        > new documents (see Until
        > now, Apache Spot ML module has been running in batch mode training and
        > getting topic distributions with the same documents it trained but that
        > needs to change soon as we are looking forward to achieving near real time.
        > >
        > >        Since this year, Apache Spot enabled Online optimizer so users
        > can select whether to run LDA using EM or Online; EM was the first option
        > we implemented and then we decided it was a good idea to offer Online as
        > well.
        > >
        > >        In my intention for keep supporting both, EM and Online
        > optimizer, I modified the code in such way that you can train with either
        > one but only get topic distributions with LocalLDAModel. The reason for
        > that is that only LocalLDAModel supports getting topic distributions for
        > new documents. The problem with that approach is that a very simple unit
        > test we have is failing now and the it is because when I convert
        > DistributedLDAModel to LocalLDAModel, the document concentration parameter
        > remains the same as it was originally provided for EM but it doesn't
        > necessarily work for LocalLDAModel.topicDistributions method.
        > >
        > >        Take a look at
        > 12878382/everythingOK.png. There you can see the expected result from
        > training and getting topic distributions with EM only or Online only in a
        > two document one word each document data set.
        > >
        > >        Then, here is the problem I explained before about converting
        > DistributedLDAModel to LocalLDAModel:
        > jira/secure/attachment/12878381/notSoOk.png
        > >
        > >        A possible solution for this is to use the following code to
        > implement a custom function to convert DistributedLDAModel to LocalLDAModel
        > (see
        > 12878380/possibleSolution.png and the code below):
        > >
        > >        package org.apache.spark.mllib.clustering
        > >
        > >        import org.apache.spark.mllib.linalg.{Matrix, Vector}
        > >
        > >        object SpotLDA {
        > >          /**
        > >            * Creates a new LocalLDAModel but it can reset alpha and beta
        > (although we just need alpha).
        > >            * @param topicsMatrix Distributed LDA Model topicsMatrix
        > >            * @param alpha New value for alpha i.e. If Model was trained
        > with 1.002 for alpha using EM optimizer, this method
        > >            *              allows you to reset alpha to something like
        > 0.0009 and get topic distributions with the desired
        > >            *              document concentration.
        > >            * @param beta New value for beta
        > >            * @return LocalLDAModel
        > >            */
        > >          def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double):
        > LocalLDAModel ={
        > >
        > >            new LocalLDAModel(topicsMatrix, alpha, beta)
        > >          }
        > >        }
        > >
        > >        The only disadvantage I see here is that users will need to
        > provide 3 parameters if they are using EM optimizer instead of only 2:
        > >
        > >        -          EM alpha
        > >
        > >        -          EM beta
        > >
        > >        -          Online alpha
        > >        Or provide only 2 parameters if they prefer to work with Online
        > Optimizer only
        > >
        > >        -          Online alpha
        > >
        > >        -          Online beta
        > >
        > >        Discussing this with Gustavo, he suggested we even set a
        > “default” number for Online alpha so if users only configure EM alpha and
        > EM beta the application will keep working.
        > >
        > >        Being said all that, here is the big question I’d like to ask:
        > should we keep supporting both, EM Optimizer and Online Optimizer and have
        > users to configure the required parameters or do you think is time to let
        > EM go and just keep Online optimizer?
        > >
        > >        My vote is for keep both but let me know if what you think.
        > >
        > >        Thanks,
        > >        Ricardo Barona

View raw message