spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: takeSample() computation
Date Thu, 05 Dec 2013 18:31:40 GMT
Hi Matt,

Try using take() instead, which will only begin computing from the start of the RDD (first
partition) if the number of elements you ask for is small.

Note that if you’re doing any shuffle operations, like groupBy or sort, then the stages
before that do have to be computed fully.

Matei

On Dec 5, 2013, at 10:13 AM, Matt Cheah <mcheah@palantir.com> wrote:

> Hi everyone,
> 
> I have a question about RDD.takeSample(). This is an action, not a transformation –
but is any optimization made to reduce the amount of computation that's done, for example
only running the transformations over a smaller subset of the data since only a sample will
be returned as a result?
> 
> The context is, I'm trying to measure the amount of time a set of transformations takes
on our dataset without persisting to disk. So I want to stack the operations on the RDD and
then invoke an action that doesn't save the result to disk but can still give me a good idea
of how long transforming the whole dataset takes.
> 
> Thanks,
> 
> -Matt Cheah


Mime
View raw message