spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: takeSample() computation
Date Fri, 06 Dec 2013 00:02:07 GMT
...and RDD.foreachPartition(_ => {}) is probably marginally faster.


On Thu, Dec 5, 2013 at 2:55 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Ah, got it. Then takeSample is going to do what you want, because it needs
> a uniform sample. If you don’t want any result at all, you can also use
> RDD.foreach() with an empty function.
>
> Matei
>
> On Dec 5, 2013, at 12:54 PM, Matt Cheah <mcheah@palantir.com> wrote:
>
>  Actually, we want the opposite – we want as much data to be computed as
> possible.
>
>  It's only for benchmarking purposes, of course.
>
>  -Matt Cheah
>
>   From: Matei Zaharia <matei.zaharia@gmail.com>
> Reply-To: "user@spark.incubator.apache.org" <
> user@spark.incubator.apache.org>
> Date: Thursday, December 5, 2013 10:31 AM
> To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Cc: Mingyu Kim <mkim@palantir.com>
> Subject: Re: takeSample() computation
>
>   Hi Matt,
>
>  Try using take() instead, which will only begin computing from the start
> of the RDD (first partition) if the number of elements you ask for is small.
>
>  Note that if you’re doing any shuffle operations, like groupBy or sort,
> then the stages before that do have to be computed fully.
>
>  Matei
>
>  On Dec 5, 2013, at 10:13 AM, Matt Cheah <mcheah@palantir.com> wrote:
>
>  Hi everyone,
>
>  I have a question about RDD.takeSample(). This is an action, not a
> transformation – but is any optimization made to reduce the amount of
> computation that's done, for example only running the transformations over
> a smaller subset of the data since only a sample will be returned as a
> result?
>
>  The context is, I'm trying to measure the amount of time a set of
> transformations takes on our dataset without persisting to disk. So I want
> to stack the operations on the RDD and then invoke an action that doesn't
> save the result to disk but can still give me a good idea of how long
> transforming the whole dataset takes.
>
>  Thanks,
>
>  -Matt Cheah
>
>
>
>

Mime
View raw message