Perhaps I'm misunderstanding your question, but RDD.sample() just uses the
fraction as the probability of accepting a given tuple (rather than, say,
taking every 7th tuple). So on average, 1/7 of the tuples will be returned.
For small input sizes, though, this could return significantly more or less
than 1/7 of the tuples simply due to chance.
On Mon, Oct 21, 2013 at 12:01 PM, Matt Cheah <mcheah@palantir.com> wrote:
> Hi everyone,
>
> I have a simple RDD of n items. The use case is to get a random sample
> of exactly k items from this RDD. n and k may or may not be very large.
>
> So right now for n = 7, k = 1, I have a unit test running locally, that
> passes the fraction 1 / 7 to RDD.sample(). The double representation as
> printed by Eclipse is 0.14285714285714285. The resulting RDD ends up
> getting 2 items back instead of 1.
>
> Is it expected to get that much error in precision? I'd rather not use
> the takeSample() function which would materialize the whole sample in the
> driver's memory.
>
> Thanks,
>
> Matt Cheah
>
