spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Re: takeSample() computation
Date Thu, 05 Dec 2013 20:54:53 GMT
Actually, we want the opposite – we want as much data to be computed as possible.

It's only for benchmarking purposes, of course.

-Matt Cheah

From: Matei Zaharia <matei.zaharia@gmail.com<mailto:matei.zaharia@gmail.com>>
Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
<user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Date: Thursday, December 5, 2013 10:31 AM
To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" <user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Cc: Mingyu Kim <mkim@palantir.com<mailto:mkim@palantir.com>>
Subject: Re: takeSample() computation

Hi Matt,

Try using take() instead, which will only begin computing from the start of the RDD (first
partition) if the number of elements you ask for is small.

Note that if you’re doing any shuffle operations, like groupBy or sort, then the stages
before that do have to be computed fully.

Matei

On Dec 5, 2013, at 10:13 AM, Matt Cheah <mcheah@palantir.com<mailto:mcheah@palantir.com>>
wrote:

Hi everyone,

I have a question about RDD.takeSample(). This is an action, not a transformation – but
is any optimization made to reduce the amount of computation that's done, for example only
running the transformations over a smaller subset of the data since only a sample will be
returned as a result?

The context is, I'm trying to measure the amount of time a set of transformations takes on
our dataset without persisting to disk. So I want to stack the operations on the RDD and then
invoke an action that doesn't save the result to disk but can still give me a good idea of
how long transforming the whole dataset takes.

Thanks,

-Matt Cheah


Mime
View raw message