spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Ganelin <ilgan...@gmail.com>
Subject Re: quickly counting the number of rows in a partition?
Date Wed, 14 Jan 2015 18:54:19 GMT
The number of records is not stored anywhere. You either need to save it at
creation time or step through the RDD.
On Wed, Jan 14, 2015 at 1:46 PM Michael Segel <msegel_hadoop@hotmail.com>
wrote:

> Sorry, but the accumulator is still going to require you to walk through
> the RDD to get an accurate count, right?
> Its not being persisted?
>
> On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com>
> wrote:
>
>  Alternative to doing a naive toArray is to declare an accumulator per
> partition and use that. It's specifically what they were designed to do.
> See the programming guide.
>
>
>
> Sent with Good (www.good.com)
>
>
> -----Original Message-----
> *From: *Tobias Pfeiffer [tgp@preferred.jp]
> *Sent: *Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
> *To: *Kevin Burton
> *Cc: *Ganelin, Ilya; user@spark.apache.org
> *Subject: *Re: quickly counting the number of rows in a partition?
>
> Hi,
>
> On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya <
> Ilya.Ganelin@capitalone.com> wrote:
>
>> Use the mapPartitions function. It returns an iterator to each partition.
>> Then just get that length by converting to an array.
>>
>
>  On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <burton@spinn3r.com> wrote:
>
>> Doesn’t that just read in all the values?  The count isn’t pre-computed?
>> It’s not the end of the world if it’s not but would be faster.
>>
>
>  Well, "converting to an array" may not work due to memory constraints,
> counting the items in the iterator may be better. However, there is no
> "pre-computed" value. For counting, you need to compute all values in the
> RDD, in general. If you think of
>
>     items.map(x => /* throw exception */).count()
>
> then even though the count you want to get does not necessarily require
> the evaluation of the function in map() (i.e., the number is the same), you
> may not want to get the count if that code actually fails.
>
> Tobias
>
> ------------------------------
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>
>

Mime
View raw message