spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jan.zi...@centrum.cz>
Subject Repartitioning by partition size, not by number of partitions.
Date Fri, 31 Oct 2014 10:26:51 GMT
Hi,

I have inpot data that are many of very small files containing one .json. 

For performance reasons (I use PySpark) I have to do repartioning, currently I do:

sc.textFile(files).coalesce(100))
 
Problem is that I have to guess the number of partitions in a such way that it's as fast as
possible and I am still on the sefe side with the RAM memory. So this is quiet difficult.

For this reason I would like to ask if there is some way, how to replace coalesce(100) by
something that creates N partitions of the given size? I went through the documentation, but
I was not able to find some way, how to do that.

thank you in advance for any help or advice. 
 


Mime
View raw message