spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Re: RDD sample fraction precision
Date Mon, 21 Oct 2013 19:25:25 GMT
Ah, I misunderstood the functionality then – I was under the impression that exactly that
fraction would be returned.

Thanks,

-Matt Cheah

From: Aaron Davidson <ilikerps@gmail.com<mailto:ilikerps@gmail.com>>
Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
<user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Date: Monday, October 21, 2013 12:18 PM
To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" <user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Subject: Re: RDD sample fraction precision

Perhaps I'm misunderstanding your question, but RDD.sample() just uses the fraction as the
probability of accepting a given tuple (rather than, say, taking every 7th tuple). So on average,
1/7 of the tuples will be returned. For small input sizes, though, this could return significantly
more or less than 1/7 of the tuples simply due to chance.

On Mon, Oct 21, 2013 at 12:01 PM, Matt Cheah <mcheah@palantir.com<mailto:mcheah@palantir.com>>
wrote:
Hi everyone,

I have a simple RDD of n items. The use case is to get a random sample of exactly k items
from this RDD. n and k may or may not be very large.

So right now for n = 7, k = 1, I have a unit test running locally, that passes the fraction
1 / 7 to RDD.sample(). The double representation as printed by Eclipse is 0.14285714285714285.
The resulting RDD ends up getting 2 items back instead of 1.

Is it expected to get that much error in precision? I'd rather not use the takeSample() function
which would materialize the whole sample in the driver's memory.

Thanks,

-Matt Cheah


Mime
View raw message