spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duy Huynh <duy.huynh....@gmail.com>
Subject Re: word2vec: how to save an mllib model and reload it?
Date Fri, 07 Nov 2014 18:10:34 GMT
thansk nick.  i'll take a look at oryx and prediction.io.

re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code.  but i'm running into some compiliation
issue now.  hopefully i can fix it soon, so to get this things going.

On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath <nick.pentreath@gmail.com>
wrote:

> For ALS if you want real time recs (and usually this is order 10s to a few
> 100s ms response), then Spark is not the way to go - a serving layer like
> Oryx, or prediction.io is what you want.
>
> (At graphflow we've built our own).
>
> You hold the factor matrices in memory and do the dot product in real time
> (with optional caching). Again, even for huge models (10s of millions
> users/items) this can be handled on a single, powerful instance. The issue
> at this scale is winnowing down the search space using LSH or similar
> approach to get to real time speeds.
>
> For word2vec it's pretty much the same thing as what you have is very
> similar to one of the ALS factor matrices.
>
> One problem is you can't access the wors2vec vectors as they are private
> val. I think this should be changed actually, so that just the word vectors
> could be saved and used in a serving layer.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks <evan.sparks@gmail.com>
> wrote:
>
>> There are a few examples where this is the case. Let's take ALS, where
>> the result is a MatrixFactorizationModel, which is assumed to be big - the
>> model consists of two matrices, one (users x k) and one (k x products).
>> These are represented as RDDs.
>>
>> You can save these RDDs out to disk by doing something like
>>
>> model.userFeatures.saveAsObjectFile(...) and
>> model.productFeatures.saveAsObjectFile(...)
>>
>> to save out to HDFS or Tachyon or S3.
>>
>> Then, when you want to reload you'd have to instantiate them into a class
>> of MatrixFactorizationModel. That class is package private to MLlib right
>> now, so you'd need to copy the logic over to a new class, but that's the
>> basic idea.
>>
>> That said - using spark to serve these recommendations on a
>> point-by-point basis might not be optimal. There's some work going on in
>> the AMPLab to address this issue.
>>
>> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <duy.huynh.uiv@gmail.com>
>> wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.sparks@gmail.com>
>>> wrote:
>>>
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>> been merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running save(), same with matlab. In python either pickling things or
>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>> pickling -
>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>> all seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh.uiv@gmail.com>
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.sparks@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if
>>>>>> you're in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh.uiv@gmail.com>
wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained
>>>>>>> and reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>> model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message