spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: quickly counting the number of rows in a partition?
Date Wed, 14 Jan 2015 18:45:36 GMT
Sorry, but the accumulator is still going to require you to walk through the RDD to get an
accurate count, right? 
Its not being persisted? 

On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com> wrote:

> Alternative to doing a naive toArray is to declare an accumulator per partition and use
that. It's specifically what they were designed to do. See the programming guide.
> 
> 
> 
> Sent with Good (www.good.com)
> 
> 
> -----Original Message-----
> From: Tobias Pfeiffer [tgp@preferred.jp]
> Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
> To: Kevin Burton
> Cc: Ganelin, Ilya; user@spark.apache.org
> Subject: Re: quickly counting the number of rows in a partition?
> 
> Hi,
> 
> On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com> wrote:
> Use the mapPartitions function. It returns an iterator to each partition. Then just get
that length by converting to an array.
>  
> On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <burton@spinn3r.com> wrote:
> Doesn’t that just read in all the values?  The count isn’t pre-computed? It’s not
the end of the world if it’s not but would be faster.
> 
> Well, "converting to an array" may not work due to memory constraints, counting the items
in the iterator may be better. However, there is no "pre-computed" value. For counting, you
need to compute all values in the RDD, in general. If you think of
> 
>     items.map(x => /* throw exception */).count()
> 
> then even though the count you want to get does not necessarily require the evaluation
of the function in map() (i.e., the number is the same), you may not want to get the count
if that code actually fails.
> 
> Tobias
> 
> The information contained in this e-mail is confidential and/or proprietary to Capital
One and/or its affiliates. The information transmitted herewith is intended only for use by
the individual or entity to which it is addressed.  If the reader of this message is not the
intended recipient, you are hereby notified that any review, retransmission, dissemination,
distribution, copying or other use of, or taking of any action in reliance upon this information
is strictly prohibited. If you have received this communication in error, please contact the
sender and delete the material from your computer.


Mime
View raw message