Yes, using mapPartitionsWithIndex, e.g. in PySpark:

>>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect()
[(0, 250), (1, 250), (2, 250), (3, 250)]

(This is not the most efficient way to get the length of an iterator, but you get the idea...)

Best,
-Sven

On Mon, Jan 12, 2015 at 6:54 PM, Kevin Burton <burton@spinn3r.com> wrote:
Is there a way to compute the total number of records in each RDD partition?

So say I had 4 partitions.. I’d want to have 

partition 0: 100 records
partition 1: 104 records
partition 2: 90 records
partition 3: 140 records

Kevin

--

Founder/CEO Spinn3r.com
Location: San Francisco, CA
… or check out my Google+ profile




--
http://sites.google.com/site/krasser/?utm_source=sig