spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <>
Subject Number of partitions for binaryFiles
Date Tue, 26 Apr 2016 18:10:38 GMT
Dear Spark developers,

I have 100 binary files in local file system that I want to load into Spark RDD. I need the
data from each file to be in a separate partition. However, I cannot make it happen:

scala> sc.binaryFiles("/data/subset").partitions.size
res5: Int = 66

The "minPartitions" parameter does not seems to help:
scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size
res8: Int = 66

At the same time, Spark produces the required number of partitions with sc.textFiles (though
I cannot use it because my files are binary):
scala> sc.textFile("/data/subset").partitions.size
res9: Int = 100

Could you suggest how to force Spark to load binary files each in a separate partition?

Best regards, Alexander

View raw message