spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asher Krim <ak...@hubspot.com>
Subject Why are ml models repartition(1)'d in save methods?
Date Fri, 13 Jan 2017 17:23:04 GMT
Hi,

I'm curious why it's common for data to be repartitioned to 1 partition
when saving ml models:

sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)

This shows up in most ml models I've seen (Word2Vec
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
PCA
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
LDA
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
Am I missing some benefit of repartitioning like this?

Thanks,
-- 
Asher Krim
Senior Software Engineer

Mime
View raw message