spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sven Krasser <kras...@gmail.com>
Subject Re: quickly counting the number of rows in a partition?
Date Tue, 13 Jan 2015 03:11:34 GMT
Yes, using mapPartitionsWithIndex, e.g. in PySpark:

>>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda
idx,iter: ((idx, len(list(iter))),)).collect()
[(0, 250), (1, 250), (2, 250), (3, 250)]

(This is not the most efficient way to get the length of an iterator, but
you get the idea...)

Best,
-Sven

On Mon, Jan 12, 2015 at 6:54 PM, Kevin Burton <burton@spinn3r.com> wrote:

> Is there a way to compute the total number of records in each RDD
> partition?
>
> So say I had 4 partitions.. I’d want to have
>
> partition 0: 100 records
> partition 1: 104 records
> partition 2: 90 records
> partition 3: 140 records
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


-- 
http://sites.google.com/site/krasser/?utm_source=sig

Mime
View raw message