spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Mogenet <adrien.moge...@contentsquare.com>
Subject Split content into multiple Parquet files
Date Tue, 08 Sep 2015 06:34:41 GMT
Hi there,

We've spent several hours to split our input data into several parquet
files (or several folders, i.e.
/datasink/output-parquets/<key>/foobar.parquet), based on a low-cardinality
key. This works very well with a when using saveAsHadoopFile, but we can't
achieve a similar thing with Parquet files.

The only working solution so far is to persist the RDD and then loop over
it N times to write N files. That does not look acceptable...

Do you guys have any suggestion to do such an operation?

-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Mime
View raw message