Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed.
If you look at ALS you'll see there is no repartitioning since the factor dataframes can be largeOn Fri, 13 Jan 2017 at 19:42, Sean Owen <email@example.com> wrote:You're referring to code that serializes models, which are quite small. For example a PCA model consists of a few principal component vector. It's a Dataset of just one element being saved here. It's re-using the code path normally used to save big data sets, to output 1 file with 1 thing as Parquet.On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <firstname.lastname@example.org> wrote:But why is that beneficial? The data is supposedly quite large, distributing it across many partitions/files would seem to make sense.On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <email@example.com> wrote:That is usually so the result comes out in one file, not partitioned over n files.On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <firstname.lastname@example.org> wrote:Hi,I'm curious why it's common for data to be repartitioned to 1 partition when saving ml models:sqlContext.createDataFrame(
(data)).repartition().write. parquet(dataPath)This shows up in most ml models I've seen (Word2Vec, PCA, LDA). Am I missing some benefit of repartitioning like this?Thanks,--Asher KrimSenior Software Engineer--Asher KrimSenior Software Engineer