mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Measuring randomness
Date Wed, 01 Jun 2011 14:33:32 GMT
On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen <srowen@gmail.com> wrote:

> In both cases, every element is picked with probability N/1000. That is the
> purest sense in which these processes can be wrong or right, to me, and
> they
> are both exactly as good as the underlying pseudo-random number generator.
> The difference is not their quality, but the number of elements that are
> chosen.
>

And how that number is specified.  And whether order is preserved.  And
whether you get samples along the way so that you can overlap computation
with I/O.

I am not sure what the distribution the median of the N values should follow
> in theory. I doubt it's Gaussian.


It is asymptotically
normal<http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598>,
for pretty broad assumptions.  For normal underlying distribution, it
converges very quickly.  For a whacky underlying distribution like the
Cauchy, less quickly.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598


> But that would be your question then --
> how likely is it that the 20 observed values are generated by this
> distribution?
>

But this doesn't really answer an important question because the underlying
data was sampled from the same distribution and a variety of defective
samplers would give similar results.


> This test would not prove all aspects of the sampler work. For example, a
> sampler that never picked 0 or 999 would have the same result (well, if
> N>2)
> as this one, when clearly it has a problem.
>

And I think that this sort of thing is the key question.

Make sure that you use sorted data as one test input.  Do a full median of
the samples because OnlineSummarizer doesn't like ordered data.


> But I think this is probably a more complicated question than you need ask
> in practice: what is the phenomenon you are worried will happen or not
> happen here?
>

Since the samplers are equal in quality by design, the only problem I can
imagine is code error.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message