spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Chan <simonc...@gmail.com>
Subject Re: word2vec: how to save an mllib model and reload it?
Date Fri, 07 Nov 2014 18:59:01 GMT
Just want to elaborate more on Duy's suggestion on using PredictionIO.

PredictionIO will store the model automatically if you return it in the
training function.
An example using CF:

 def train(data: PreparedData): PersistentMatrixFactorizationModel = {
    val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda)
    new PersistentMatrixFactorizationModel(
      rank = m.rank,
      userFeatures = m.userFeatures,
      productFeatures = m.productFeatures)
  }


And the persisted model will be passed to the predict function when you
query for prediction:

def predict(
    model: PersistentMatrixFactorizationModel,
    query: Query): PredictedResult = {
    val productScores = model.recommendProducts(query.user, query.num)
      .map (r => ProductScore(r.product, r.rating))
    new PredictedResult(productScores)}



Some templates and tutorials for MLlib are here:
http://docs.prediction.io/0.8.1/templates/

Simon


On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath <nick.pentreath@gmail.com>
wrote:

> Sure - in theory this sounds great. But in practice it's much faster and a
> whole lot simpler to just serve the model from single instance in memory.
> Optionally you can multithread within that (as Oryx 1 does).
>
> There are very few real world use cases where the model is so large that
> it HAS to be distributed.
>
> Having said this, it's certainly possible to distribute model serving for
> factor-like models (like ALS). One idea I'm working on now is using
> Elasticsearch for exactly this purpose - but that more because I'm using it
> for filtering of recommendation results and combining with search, so
> overall it's faster to do it this way.
>
> For the pure matrix algebra part, single instance in memory is way faster.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh <duy.huynh.uiv@gmail.com> wrote:
>
>> hi nick.. sorry about the confusion.  originally i had a question
>> specifically about word2vec, but my follow up question on distributed model
>> is a more general question about saving different types of models.
>>
>> on distributed model, i was hoping to implement a model parallelism, so
>> that different workers can work on different parts of the models, and then
>> merge the results at the end at the single master model.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Currently I see the word2vec model is collected onto the master, so the
>>> model itself is not distributed.
>>>
>>> I guess the question is why do you need  a distributed model? Is the
>>> vocab size so large that it's necessary? For model serving in general,
>>> unless the model is truly massive (ie cannot fit into memory on a modern
>>> high end box with 64, or 128GB ram) then single instance is way faster and
>>> simpler (using a cluster of machines is more for load balancing / fault
>>> tolerance).
>>>
>>> What is your use case for model serving?
>>>
>>> —
>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <duy.huynh.uiv@gmail.com>
>>> wrote:
>>>
>>>> you're right, serialization works.
>>>>
>>>> what is your suggestion on saving a "distributed" model?  so part of
>>>> the model is in one cluster, and some other parts of the model are in other
>>>> clusters.  during runtime, these sub-models run independently in their own
>>>> clusters (load, train, save).  and at some point during run time these
>>>> sub-models merge into the master model, which also loads, trains, and saves
>>>> at the master level.
>>>>
>>>> much appreciated.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.sparks@gmail.com>
>>>> wrote:
>>>>
>>>>> There's some work going on to support PMML -
>>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>>> been merged into master.
>>>>>
>>>>> What are you used to doing in other environments? In R I'm used to
>>>>> running save(), same with matlab. In python either pickling things or
>>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>>> pickling -
>>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>>> all seem basically equivalent java serialization to me..
>>>>>
>>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>>> something) make sense to add?
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh.uiv@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> that works.  is there a better way in spark?  this seems like the
>>>>>> most common feature for any machine learning work - to be able to
save your
>>>>>> model after training it and load it later.
>>>>>>
>>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.sparks@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Plain old java serialization is one straightforward approach
if
>>>>>>> you're in java/scala.
>>>>>>>
>>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh.uiv@gmail.com>
wrote:
>>>>>>>
>>>>>>>> what is the best way to save an mllib model that you just
trained
>>>>>>>> and reload
>>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>>> model...
>>>>>>>> thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive
at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message