Hi Sebastian,
I believe the reasoning was as follows. The actual number of times we expect an element to occur in sampling with replacement is given by the binomial distribution (http://en.wikipedia.org/wiki/Binomial_distribution), but for rare events this can be approximated with a Poisson distribution: http://en.wikipedia.org/wiki/Poisson_limit_theorem.
It's true though that this is just an approximation for large datasets and small sampling fractions. We might want to do something different for small ones, or to explain that this is an approximation in the docs of takeSample.
For the first part (can you take a big sample and then truncate it), I think the approach is fine -- we are just taking a uniform sample out of the larger sample we pulled out with replacement. For example, if you sampled 2n elements from a distribution with replacement, and then took half of those at random, it should be the same as taking n with replacement at the start. The "with replacement" part in particular means that the underlying chance of drawing each element doesn't change as you take more.
Matei
On Sep 27, 2013, at 7:38 AM, Sebastian Schelter wrote:
> Hi,
>
> I saw that Spark's RDD's allow taking a fixed size sample with
> replacement. If I read the code correctly, it takes a sample larger than
> asked for, randomly shuffles the sampled datapoints and truncates the
> sample to the number of elements requested.
>
> The sampling itself is done by approximating the distribution of the
> number of occurrences of each datapoint in the sample with a Poisson
> distribution. Can you point me to some resource describing the
> mathematical background of this approach?
>
> Best,
> Sebastian
>