commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] EmpiricalDistribution
Date Wed, 07 Sep 2011 16:34:13 GMT
On 9/6/11 8:58 AM, Mikkel Meyer Andersen wrote:
> 2011/9/6 Phil Steitz <>:
>> On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
>>> 2011/9/5 Phil Steitz <>:
>>>> I have a couple of proposals for this class:
>>>> 0) Merge the interface and impl.   This is consistent with what we
>>>> are doing in some other places where we have only one implementation.
>>> Fine with me.
>>>> 1) Extend this class to actually provide a distribution - i.e.
>>>> implement the Distribution interface.
>>> Won't we have problems, e.g. with implementing cumulativeProbability?
>> The idea I had was to interpolate within bins.  So to compute the
>> cdf at x you would find its bin, sum the mass (based on number of
>> original sample points contained, like the sampling does) of the
>> bins below its containing bin and then use the defined kernel within
>> bin to determine how much of its own bin's mass to include.
> Seems reasonable. But: We might want to include a user specified
> support - just simple (endpoints of an interval) - or else the highest
> and lowest value specifies the support which might not be a good idea.

By the latter, do you mean just interpolate linearly between lowest
and highest, or do you mean the lowest / highest actually observed
points in the bin?  The first is like using a uniform kernel in the
bins.  By "user-specified support" I guess you mean make the
interpolation strategy pluggable somehow, right?   What launched me
into thinking about making the kernel used for sampling configurable
was thinking about how uniform would probably be better / more
defensible for use interpolating the cdf in some cases.  Then you
have to ask is it OK to use a different kernel for the sampling vs
cdf computation.  My instinct is to say no and keep it simple -
allow a uniform kernel to be chosen in place of the hard-coded
Gaussian there now and then use the configured kernel for both
sampling and cdf computation.  Even with mixed kernels, you will
probably in most cases end up with decent fidelity between sampling
results and the cdf; but I can imagine scenarios where Gaussian
kernels with coarse grids could lead to funny sampling distributions
that would not follow the linearly-interpolated cdf very well near
grid points.

>>>> 2) make the kernel used within bins configurable.  Currently, values
>>>> are generated (and the cdf would be computed) assuming a Gaussian
>>>> distribution within bins.  I think at least a uniform option should
>>>> be provided.
>>> +1, maybe it can be generalised to providing user-defined kernels.
>> Good idea.  Need to think about how to enable that.
>> Thanks!
>> Phil
>>>> Thanks in advance for any feedback on this or further suggestions
>>>> for improvement.
>>>> Phil
> Cheers, Mikkel.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message