spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com>
Subject sparkcontext.objectFile return thousands of partitions
Date Wed, 21 Jan 2015 15:31:10 GMT
Why sc.objectFile(...) return a Rdd with thousands of partitions?

I save a rdd to file system using

rdd.saveAsObjectFile("file:///tmp/mydir")

Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains
8 partitions

part-00000  part-00002  part-00004  part-00006  _SUCCESS
part-00001  part-00003  part-00005  part-00007

I then load the rdd back using

val rdd2 = sc.objectFile[LabeledPoint]( ("file:///tmp/mydir", 8)

I expect rdd2 to have 8 partitions. But from the master UI, I see that rdd2 has over 1000
partitions. This is very inefficient. How can I limit it to 8 partitions just like what is
stored on the file system?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541


Mime
View raw message