spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asher Krim <ak...@hubspot.com>
Subject Re: Why are ml models repartition(1)'d in save methods?
Date Mon, 16 Jan 2017 16:41:00 GMT
Cool, thanks!

Jira: https://issues.apache.org/jira/browse/SPARK-19247
PR: https://github.com/apache/spark/pull/16607

I think the LDA model has the exact same issues - currently the
`topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and
k=1000) is saved as a single element in a case class. We should probably
address this in another issue.

On Fri, Jan 13, 2017 at 3:55 PM, Sean Owen <sowen@cloudera.com> wrote:

> Yes, certainly debatable for word2vec. You have a good point that this
> could overrun the 2GB limit if the model is one big datum, for large but
> not crazy models. This model could probably easily be serialized as
> individual vectors in this case. It would introduce a
> backwards-compatibility issue but it's possible to read old and new
> formats, I believe.
>
> On Fri, Jan 13, 2017 at 8:16 PM Asher Krim <akrim@hubspot.com> wrote:
>
>> I guess it depends on the definition of "small". A Word2vec model with
>> vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a
>> single machine (so isn't really "big" data), I don't see the benefit in
>> having the model stored in one file. On the contrary, it seems that we
>> would want the model to be distributed:
>> * avoids shuffling of data to one executor
>> * allows the whole cluster to participate in saving the model
>> * avoids rpc issues (http://stackoverflow.com/questions/40842736/spark-
>> word2vecmodel-exceeds-max-rpc-size-for-saving)
>> * "feature parity" with mllib (issues with one large model file already
>> solved for mllib in SPARK-11994
>> <https://issues.apache.org/jira/browse/SPARK-11994>)
>>
>>
>> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>> Yup - it's because almost all model data in spark ML (model coefficients)
>> is "small" - i.e. Non distributed.
>>
>> If you look at ALS you'll see there is no repartitioning since the factor
>> dataframes can be large
>> On Fri, 13 Jan 2017 at 19:42, Sean Owen <sowen@cloudera.com> wrote:
>>
>> You're referring to code that serializes models, which are quite small.
>> For example a PCA model consists of a few principal component vector. It's
>> a Dataset of just one element being saved here. It's re-using the code path
>> normally used to save big data sets, to output 1 file with 1 thing as
>> Parquet.
>>
>> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <akrim@hubspot.com> wrote:
>>
>> But why is that beneficial? The data is supposedly quite large,
>> distributing it across many partitions/files would seem to make sense.
>>
>> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> That is usually so the result comes out in one file, not partitioned over
>> n files.
>>
>> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <akrim@hubspot.com> wrote:
>>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
>> PCA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
>> LDA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>

Mime
View raw message