spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject takeSample() computation
Date Thu, 05 Dec 2013 18:13:40 GMT
Hi everyone,

I have a question about RDD.takeSample(). This is an action, not a transformation – but
is any optimization made to reduce the amount of computation that's done, for example only
running the transformations over a smaller subset of the data since only a sample will be
returned as a result?

The context is, I'm trying to measure the amount of time a set of transformations takes on
our dataset without persisting to disk. So I want to stack the operations on the RDD and then
invoke an action that doesn't save the result to disk but can still give me a good idea of
how long transforming the whole dataset takes.

Thanks,

-Matt Cheah

Mime
View raw message