spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: RDD sample fraction precision
Date Mon, 21 Oct 2013 19:18:58 GMT
Perhaps I'm misunderstanding your question, but RDD.sample() just uses the
fraction as the probability of accepting a given tuple (rather than, say,
taking every 7th tuple). So on average, 1/7 of the tuples will be returned.
For small input sizes, though, this could return significantly more or less
than 1/7 of the tuples simply due to chance.

On Mon, Oct 21, 2013 at 12:01 PM, Matt Cheah <mcheah@palantir.com> wrote:

>  Hi everyone,
>
>  I have a simple RDD of n items. The use case is to get a random sample
> of exactly k items from this RDD. n and k may or may not be very large.
>
>  So right now for n = 7, k = 1, I have a unit test running locally, that
> passes the fraction 1 / 7 to RDD.sample(). The double representation as
> printed by Eclipse is 0.14285714285714285. The resulting RDD ends up
> getting 2 items back instead of 1.
>
>  Is it expected to get that much error in precision? I'd rather not use
> the takeSample() function which would materialize the whole sample in the
> driver's memory.
>
>  Thanks,
>
>  -Matt Cheah
>

Mime
View raw message