spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath" <nick.pentre...@gmail.com>
Subject Re: word2vec: how to save an mllib model and reload it?
Date Fri, 07 Nov 2014 17:20:47 GMT
Currently I see the word2vec model is collected onto the master, so the model itself is not
distributed. 


I guess the question is why do you need  a distributed model? Is the vocab size so large
that it's necessary? For model serving in general, unless the model is truly massive (ie cannot
fit into memory on a modern high end box with 64, or 128GB ram) then single instance is way
faster and simpler (using a cluster of machines is more for load balancing / fault tolerance).




What is your use case for model serving?


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <duy.huynh.uiv@gmail.com> wrote:

> you're right, serialization works.
> what is your suggestion on saving a "distributed" model?  so part of the
> model is in one cluster, and some other parts of the model are in other
> clusters.  during runtime, these sub-models run independently in their own
> clusters (load, train, save).  and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
> much appreciated.
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.sparks@gmail.com>
> wrote:
>> There's some work going on to support PMML -
>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>> merged into master.
>>
>> What are you used to doing in other environments? In R I'm used to running
>> save(), same with matlab. In python either pickling things or dumping to
>> json seems pretty common. (even the scikit-learn docs recommend pickling -
>> http://scikit-learn.org/stable/modules/model_persistence.html). These all
>> seem basically equivalent java serialization to me..
>>
>> Would some helper functions (in, say, mllib.util.modelpersistence or
>> something) make sense to add?
>>
>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh.uiv@gmail.com>
>> wrote:
>>
>>> that works.  is there a better way in spark?  this seems like the most
>>> common feature for any machine learning work - to be able to save your
>>> model after training it and load it later.
>>>
>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.sparks@gmail.com>
>>> wrote:
>>>
>>>> Plain old java serialization is one straightforward approach if you're
>>>> in java/scala.
>>>>
>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh.uiv@gmail.com> wrote:
>>>>
>>>>> what is the best way to save an mllib model that you just trained and
>>>>> reload
>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>> thanks.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
Mime
View raw message