I guess it depends on the definition of "small". A Word2vec model with vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a single machine (so isn't really "big" data), I don't see the benefit in having the model stored in one file. On the contrary, it seems that we would want the model to be distributed:
* avoids shuffling of data to one executor
* allows the whole cluster to participate in saving the model
* avoids rpc issues (http://stackoverflow.com/questions/40842736/spark-word2vecmodel-exceeds-max-rpc-size-for-saving)
* "feature parity" with mllib (issues with one large model file already solved for mllib inĀ SPARK-11994)

On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <nick.pentreath@gmail.com> wrote:
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed.

If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen <sowen@cloudera.com> wrote:
You're referring to code that serializes models, which are quite small. For example a PCA model consists of a few principal component vector. It's a Dataset of just one element being saved here. It's re-using the code path normally used to save big data sets, to output 1 file with 1 thing as Parquet.

On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <akrim@hubspot.com> wrote:
But why is that beneficial? The data is supposedly quite large, distributing it across many partitions/files would seem to make sense.

On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <sowen@cloudera.com> wrote:
That is usually so the result comes out in one file, not partitioned over n files.

On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <akrim@hubspot.com> wrote:

I'm curious why it's common for data to be repartitioned to 1 partition when saving ml models:


This shows up in most ml models I've seen (Word2Vec, PCA, LDA). Am I missing some benefit of repartitioning like this?

Asher Krim
Senior Software Engineer

Asher Krim
Senior Software Engineer