spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Re: quickly counting the number of rows in a partition?
Date Wed, 14 Jan 2015 01:06:09 GMT
Hi,

On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com>
 wrote:

> Use the mapPartitions function. It returns an iterator to each partition.
> Then just get that length by converting to an array.
>

On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <burton@spinn3r.com> wrote:

> Doesn’t that just read in all the values?  The count isn’t pre-computed?
> It’s not the end of the world if it’s not but would be faster.
>

Well, "converting to an array" may not work due to memory constraints,
counting the items in the iterator may be better. However, there is no
"pre-computed" value. For counting, you need to compute all values in the
RDD, in general. If you think of

    items.map(x => /* throw exception */).count()

then even though the count you want to get does not necessarily require the
evaluation of the function in map() (i.e., the number is the same), you may
not want to get the count if that code actually fails.

Tobias

Mime
View raw message