spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edwards, Brandon" <brandon.edwa...@intel.com>
Subject Re: Spark EM LDA Optimizer support
Date Fri, 21 Jul 2017 19:05:02 GMT
Yes it makes sense that the results would differ if you ‘scored’ data on a distributed
model and then later tried to rescore with the saved model (that was necessarily converted
to local). 

My opinion is that we only should support use cases where you carry over a model that was
trained on one batch of data to be the starting point for the training on the next batch.
Let me know if anyone disagrees with that.

Based on the above opinion, my concern is this. If someone wanted to use the EM optimizer,
they could only use it on the first batch. From then on, they are loading saved local models
in which case you can only continue with Online optimization. This is an extra complication
for configuration. I mean, saying you wanted EM as your optimizer in the config would mean
only that the first run was done that way.

On the other hand, I guess it’s possible that starting with a large batch using EM would
perform better in the long run than starting with Online? We could look into this, but our
tests so far have shown Online to be up to par with EM if I recall correctly. 

On 7/21/17, 11:44 AM, "Barona, Ricardo" <ricardo.barona@intel.com> wrote:

    Once a saved model is loaded it needs to be converted to LocalLDAModel if it’s a DistributedLDAModel
but from what I heard, the importance of what you used for training, EM and Online is in the
topics matrix that generates one and the other. I’m not exactly and expert but I’d think
they are going to be different, right? The topics matrix of a LocalLDAModel coming from DistributedLDAModel
will remain the same and topic distributions will be calculated based on that. 
    
    On 7/21/17, 1:26 PM, "Edwards, Brandon" <brandon.edwards@intel.com> wrote:
    
        A question just came up for me. Is there a true use case for utilizing EM that allows
one to carry context from previous models into the future? It seems that once you save to
a local model in order to utilize it for future data, from then on you only can use the Online
optimizer. If this is correct, I vote for getting rid of EM. I don’t see value in supporting
a use case that does not carry context into future models.
        
        On 7/21/17, 11:08 AM, "Barona, Ricardo" <ricardo.barona@intel.com> wrote:
        
            During the last 9 days, I've been working on modifying Apache Spot LDA wrapper
to enable the possibility of saving models and load existing models and then get topic distributions
for the same corpus or for new documents (see https://issues.apache.org/jira/browse/SPOT-196).
Until now, Apache Spot ML module has been running in batch mode training and getting topic
distributions with the same documents it trained but that needs to change soon as we are looking
forward to achieving near real time.
            
            Since this year, Apache Spot enabled Online optimizer so users can select whether
to run LDA using EM or Online; EM was the first option we implemented and then we decided
it was a good idea to offer Online as well.
            
            In my intention for keep supporting both, EM and Online optimizer, I modified
the code in such way that you can train with either one but only get topic distributions with
LocalLDAModel. The reason for that is that only LocalLDAModel supports getting topic distributions
for new documents. The problem with that approach is that a very simple unit test we have
is failing now and the it is because when I convert DistributedLDAModel to LocalLDAModel,
the document concentration parameter remains the same as it was originally provided for EM
but it doesn't necessarily work for LocalLDAModel.topicDistributions method.
            
            Take a look at https://issues.apache.org/jira/secure/attachment/12878382/everythingOK.png.
There you can see the expected result from training and getting topic distributions with EM
only or Online only in a two document one word each document data set.
            
            Then, here is the problem I explained before about converting DistributedLDAModel
to LocalLDAModel: https://issues.apache.org/jira/secure/attachment/12878381/notSoOk.png
            
            A possible solution for this is to use the following code to implement a custom
function to convert DistributedLDAModel to LocalLDAModel (see https://issues.apache.org/jira/secure/attachment/12878380/possibleSolution.png
and the code below):
            
            package org.apache.spark.mllib.clustering
            
            import org.apache.spark.mllib.linalg.{Matrix, Vector}
            
            object SpotLDA {
              /**
                * Creates a new LocalLDAModel but it can reset alpha and beta (although we
just need alpha).
                * @param topicsMatrix Distributed LDA Model topicsMatrix
                * @param alpha New value for alpha i.e. If Model was trained with 1.002 for
alpha using EM optimizer, this method
                *              allows you to reset alpha to something like 0.0009 and get
topic distributions with the desired
                *              document concentration.
                * @param beta New value for beta
                * @return LocalLDAModel
                */
              def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double): LocalLDAModel
={
            
                new LocalLDAModel(topicsMatrix, alpha, beta)
              }
            }
            
            The only disadvantage I see here is that users will need to provide 3 parameters
if they are using EM optimizer instead of only 2:
            
            -          EM alpha
            
            -          EM beta
            
            -          Online alpha
            Or provide only 2 parameters if they prefer to work with Online Optimizer only
            
            -          Online alpha
            
            -          Online beta
            
            Discussing this with Gustavo, he suggested we even set a “default” number
for Online alpha so if users only configure EM alpha and EM beta the application will keep
working.
            
            Being said all that, here is the big question I’d like to ask: should we keep
supporting both, EM Optimizer and Online Optimizer and have users to configure the required
parameters or do you think is time to let EM go and just keep Online optimizer?
            
            My vote is for keep both but let me know if what you think.
            
            Thanks,
            Ricardo Barona
            
        
        
    
    

Mime
View raw message