spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: takeSample() computation
Date Thu, 05 Dec 2013 22:55:39 GMT
Ah, got it. Then takeSample is going to do what you want, because it needs a uniform sample.
If you don’t want any result at all, you can also use RDD.foreach() with an empty function.


On Dec 5, 2013, at 12:54 PM, Matt Cheah <> wrote:

> Actually, we want the opposite – we want as much data to be computed as possible.
> It's only for benchmarking purposes, of course.
> -Matt Cheah
> From: Matei Zaharia <>
> Reply-To: "" <>
> Date: Thursday, December 5, 2013 10:31 AM
> To: "" <>
> Cc: Mingyu Kim <>
> Subject: Re: takeSample() computation
> Hi Matt,
> Try using take() instead, which will only begin computing from the start of the RDD (first
partition) if the number of elements you ask for is small.
> Note that if you’re doing any shuffle operations, like groupBy or sort, then the stages
before that do have to be computed fully.
> Matei
> On Dec 5, 2013, at 10:13 AM, Matt Cheah <> wrote:
>> Hi everyone,
>> I have a question about RDD.takeSample(). This is an action, not a transformation
– but is any optimization made to reduce the amount of computation that's done, for example
only running the transformations over a smaller subset of the data since only a sample will
be returned as a result?
>> The context is, I'm trying to measure the amount of time a set of transformations
takes on our dataset without persisting to disk. So I want to stack the operations on the
RDD and then invoke an action that doesn't save the result to disk but can still give me a
good idea of how long transforming the whole dataset takes.
>> Thanks,
>> -Matt Cheah

View raw message