commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otmar Ertl <otmar.e...@gmail.com>
Subject Re: [math] random boolean arrays
Date Mon, 13 Jul 2015 20:19:08 GMT
On Mon, Jul 13, 2015 at 9:59 PM, Phil Steitz <phil.steitz@gmail.com> wrote:
> On 7/13/15 12:55 PM, Phil Steitz wrote:
>> On 7/13/15 12:16 PM, Otmar Ertl wrote:
>>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <gilles@harfang.homelinux.org>
>>> wrote:
>>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote:
>>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote:
>>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <gilles@harfang.homelinux.org>
>>>>>> wrote:
>>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote:
>>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote:
>>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>>>>>>>>>> The code implemented in MATH-1242 to
improve performance of KS
>>>>>>>>>>>>> monteCarloP in-lines efficient generation
of random boolean
>>>>>>>>>>>>> arrays.
>>>>>>>>>>>>> Unfortunately, I think the implementation
is not quite random
>>>>>>>>>>>>> (see
>>>>>>>>>>>>> comments on the ticket). To verify it,
we need to be able to test
>>>>>>>>>>>>> the random boolean array generation directly.
To do that, we have
>>>>>>>>>>>>> to either expose the method (at least
as protected) in the KS
>>>>>>>>>>>>> class
>>>>>>>>>>>>> or add it somewhere else. I propose the
latter but am not sure
>>>>>>>>>>>>> where to put it. For speed, we need to
avoid array copies, so the
>>>>>>>>>>>>> API will have to be something like randomize(boolean[],
nTrue).
>>>>>>>>>>>>> It
>>>>>>>>>>>>> could go in the swelling MathArrays class,
or
>>> RandomDataGenerator.
>>>>>>>>>>>>> The latter probably makes more sense,
but the API does not fit
>>> too
>>>>>>>>>>>>> well. Any ideas?
>>>>>>>>>>>> If it is just for testing purposes, you can
also access the method
>>>>>>>>>>>> in
>>>>>>>>>>>> question via reflection, see an example here:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes
>>>>>>>>>>> Do you think it *should* be a private method
of the K-S class?
>>>>>>>>>> Right now, I do not see much uses outside the class,
but if we
>>> decide
>>>>>>>>>> to
>>>>>>>>>> make it public then I would prefer a special util
class in the
>>> random
>>>>>>>>>> package to avoid cluttering the MathArrays class.
>>>>>>>>> OK, for now we can make it private and use the reflection
method
>>>>>>>>> above to test it.
>>>>>>>> ok, but I guess it is also fine to make it package private
as sebb
>>>>>>>> suggested. We did something similar recently for some of
the improved
>>>>>>>> sampling methods provided by Otmar.
>>>>>>>>
>>>>>>>>>> Regarding the RandomDataGenerator: I think this class
should be
>>>>>>>>>> deprecated and replaced by a Sampler interface as
proposed by
>>> Gilles.
>>>>>>>>> Please consider keeping this class. Consider this a user
request.
>>>>>>>>> I have quite a few applications that use this class for
two reasons:
>>>>>>>> ok, the reason why I thought the class should be deprecated
is because
>>>>>>>> it was not kept up-to-date with all the new discrete and
continuous
>>>>>>>> distributions that we added in the last 2-3 years. If you
think it is
>>>>>>>> useful, then we can keep it of course.
>>>>>>>>
>>>>>>>>> 1. One object instance tied to one PRNG that generates
data from
>>>>>>>>> multiple different distributions. This is convenient.
Sure, I
>>>>>>>>> could refactor all of these apps to instantiate new objects
for each
>>>>>>>>> type of generated data and hopefully still be able to
peg them to
>>>>>>>>> one PRNG; but that is needless work that also complicates
the code.
>>>>>>>>>
>>>>>>>>> 2. There are quite a few methods in this class that have
nothing to
>>>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx,
>>>>>>>>> etc) but which conveniently share the RandomGenerator.
I guess the
>>>>>>>>> utility methods get moved out somewhere else. Again,
I end up
>>>>>>>>> having to refactor all of my code that uses them and
when I want
>>>>>>>>> simulations to be based on a single PRNG, I have to find
a way to
>>>>>>>>> pass the RandomGenerator around to them.
>>>>>>>>>
>>>>>>>>> I don't yet see the need to refactor the sampling support
in the
>>>>>>>>> distributions package; but as my own apps are not impacted
by this,
>>>>>>>>> if everyone else sees the user impact of the refactoring
as
>>>>>>>>> outweighed by the benefit, I won't stand in the way.
Please lets
>>>>>>>>> just keep the RandomDataGenerator convenience class in
the random
>>>>>>>>> package in any case. I will take care of whatever adjustments
are
>>>>>>>>> needed to adapt to whatever we settle on for sampling
in the
>>>>>>>>> distributions package.
>>>>>>>> Well, it is not really necessary to do everything together
and
>>> refactor
>>>>>>>> the distributions.
>>>>>>>>
>>>>>>>> Probably it is better to start the other way round, and describe
what
>>> I
>>>>>>>> want to add, and see how other things fit in:
>>>>>>>>
>>>>>>>> * I want a generic Sampler interface, i.e. something like
this:
>>>>>>>> ** nextSample()
>>>>>>>> ** nextSamples(int size)
>>>>>>>> ** nextSamples(double[] samples)
>>>>>>> +1
>>>>>>>
>>> Do we really need all those 3 methods? If the functionality provided by the
>>> two latter methods is essential for convenience reasons, we could also
>>> offer utility functions that are able to fill an array with random numbers
>>> from a given sampler, e.g. MathArrays.fill(array, sampler) or
>>> array=MathArrays.generate(sampler, size).
>> In some cases, there may be distribution-specific algorithms for
>> generating sequences of values that are more efficient than just
>> calling sample() repeatedly.  So while providing a default impl in
>> an abstract sampler that just calls sample() repeatedly would make
>> sense; it might also make sense to allow the distribution to
>> override with something more efficient.   An example (not exposed
>> now) is sampling from a Gaussian distribution.  You can see that
>> under the covers, the BitStreamGenerator generates these in pairs.

The simultaneous generation of multiple random values, as done by some
algorithms (I only know the Box-Muller method), does not justify the
extra interface methods. With negligible overhead the extra values can
be cached and reused for subsequent calls. I prefer to keep interfaces
as simple as possible.

Otmar

>>>>>>>> there could be a DiscreteSampler and ContinuousSampler interface
to
>>>>>>>> handle the cases for int / double values.
>>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler,
to
>>>>>>> accomodate
>>>>>>> future needs (LongSampler, BooleanSampler (?)).
>>>>>> I would prefer consistent names for distributions and corresponding
>>>>>> samplers. The support of distributions must be of the same data type
>>>>>> as the return values of the corresponding sampler. Therefore, I would
>>>>>> call the samplers for RealDistribution and IntegerDistribution
>>>>>> RealSampler and IntegerSampler, respectively.
>>>> Ideally yes. But there is always the lingering question of what "Real"
>>>> means: the mathematical abstraction or the numerical representation?
>>>> The same "real" distributions could be implemented with "float"s.
>>>> If we ever need implementing a discrete distribution that is able to
>>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly
>>>> refer to the numerical type, not the mathematical abstraction of an
>>>> "integer" number...
>>>>
>>>> Gilles
>>>>
>>>>
>>> The underlying data type of RealDistribution implementations is implicitly
>>> defined to be "double". It would not make sense that methods such as
>>> getSupportLowerBound() return a different type than the associated sample
>>> method. Therefore, in the context of RealDistribution "real" means "double"
>>> at the moment. However, I am open for renaming RealDistribution as
>>> DoubleDistribution with a sampler named DoubleSampler. If a float-based
>>> implementation is really ever needed, interfaces named FloatDistribution
>>> and FloatSampler could be introduced.
>> Let's not change the names just to change them.  There is discussion
>> in the archives that led us to call what is traditionally referred
>> to as continuous, "Real" - mostly because we support some not
>> absolutely continuous distribution functions among the non-discrete
>> distributions.  We call discrete "Integer" because we force all
>> discrete domains to map into the integers.  In the 10+ years we have
>> offered both, we have never had a request to either support floats
>> as representations of reals or any discrete distribution that
>> requires the full Long value set.
> More precisely the full Long value set as actual parameters to
> distribution functions.  We already support distributions with
> infinite value sets.
>
> Phil
>>
>> Phil
>>>>> +1
>>>>>
>>>>> Phil
>>>>>>>> The distributions could either be changed to return such
a sampler as
>>>>>>>> Gilles proposed (with the advantage that no random instance
is tied to
>>>>>>>> the distribution itself), or implement the interface directly
(with
>>> the
>>>>>>>> advantage that we would not need to refactor too much).
>>>>>>> Unless I'm missing something, the refactoring would be fairly
the same:
>>>>>>> The latter case needs implementing 3 methods (2 new ones, one
with a
>>>>>>> name
>>>>>>> change).
>>>>>>> The former needs implementing the factory method proposed in
MATH-1158,
>>>>>>> plus the same methods as above (wrapped in the object returned
by the
>>>>>>> factory method).
>>>>>>>
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>> One can then create a sampler for any distribution
or from other
>>>>>>>>>> sources, e.g. when needing a fast and efficient sampler
without
>>>>>>>>>> replacement (see MATH-1239).
>>>>>>>>> +1 for sequential sampling. I don't follow exactly why
that
>>>>>>>>> requires refactoring the distributions; but if it helps
in a way I
>>>>>>>>> don't yet understand, that will help convince me that
refactoring
>>>>>>>>> sampling in the distributions package is worth the user
pain.
>>>>>>>> as I said above, I wanted to combine two things in one step,
maybe it
>>>>>>>> is
>>>>>>>> better to go step by step.
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>
>>>>>> Otmar
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>> Otmar
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message