commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [rng] Copying samplers
Date Sat, 04 May 2019 22:52:03 GMT


> On 4 May 2019, at 22:34, Gilles Sadowski <gilleseran@gmail.com> wrote:
> 
> Hi.
> 
> Le sam. 4 mai 2019 à 21:31, Alex Herbert <alex.d.herbert@gmail.com> a écrit :
>> 
>> 
>> 
>>> On 4 May 2019, at 14:46, Gilles Sadowski <gilleseran@gmail.com> wrote:
>>> 
>>> Hello.
>>> 
>>> Le ven. 3 mai 2019 à 16:57, Alex Herbert <alex.d.herbert@gmail.com <mailto:alex.d.herbert@gmail.com>>
a écrit :
>>>> 
>>>> Most of the samplers in the library have very small states that are easy
>>>> to compute. Some have computations that are more expensive, such as the
>>>> LargeMeanPoissonSampler or the DiscreteProbabilityCollectionSampler.
>>>> 
>>>> However once the state is computed the only part of the state that
>>>> changes is the RNG. I would like to suggest a way to copy samplers as
>>>> something like:
>>>> 
>>>> DiscreteSampler newInstance(UniformRandomProvider)
>>>> 
>>>> The new instance would share all the private state of the first sampler
>>>> except the RNG. This can be used for multi-threaded applications which
>>>> require a new sampler per thread but sample from the same distribution.
>>>> 
>>>> A particular case in point is the as yet not integrated
>>>> MarsagliaTsangWangSmallMeanPoissonSampler (see RNG-91 [1]) which has a
>>>> "large" state [2] that takes a "long" time [3] to compute but is
>>>> effectively immutable. This could be shared across instances saving
>>>> memory for parallel application.
>>>> 
>>>> A copy instance would be almost zero set-up time and provide opportunity
>>>> for caching of commonly used samplers.
>>> 
>>> The goal is sharing (immutable) state so it seems that the semantics is
>>> not "copy".
>>> 
>>> Isn't it a "factory" that we are after?  E.g. something like:
>>> public final class CachedSamplingFactory {
>>>   private static PoissonSamplerCache poisson = new PoissonSamplerCache();
>>> 
>>>   public PoissonSampler createPoissonSampler(UniformRandomProvider
>>> rng, double mean) {
>>>       if (!poisson.isCached(mean)) {
>>>           poisson.createCache(mean); // Initialize (requires
>>> synchronization) ...
>>>       }
>>>       return new PoissonSampler(poisson.getCache(mean), rng); //
>>> Construct using pre-built state.
>>>   }
>>> }
>>> [It may be overkill, more work, and less performant…]
>> 
>> But you need a factory for every class you want to share state for. And the factory
actually has to look in a cache. If you operate on an instance then you get what you want.
Another working version of the same sampler. It would also be thread safe without synchronisation
as long as the state is immutable. The only mutable state is the passed in RNG.
> 
> Agreed.  It was what I meant by the last sentence.
> 
>>> 
>>> IIUC, you suggest to add "newInstance" in the "DiscreatSampler" interface (?).
>> 
>> I did think of extending DiscreteSampler with this functionality. Not adding to the
interface as it currently is ‘functional’ as it has only one method. I think that should
not change. Having thought about it a bit more I like the idea of a new functional interface.
Perhaps:
>> 
>> interface DiscreteSamplerProvider {
>>    DiscreteSampler create(UniformRandomProvider rng);
>> }
>> 
>> Substitute ‘Provider’ for:
>> 
>> - Generator
>> - Supplier (possible clash or alignment with Java 8 depending on the way it is done)
>> - Factory (though the method is not static so I do not like this)
>> - etc
>> 
>> So this then becomes a functional interface that can be used by anything. However
instances of a sampler would be expected to return a sampler matching their own functionality.
>> 
>> Note there are some samplers not implementing an interface that also could benefit
from this. Namely CollectionSampler and DiscreteProbabilityCollectionSampler. So does this
need a generic interface:
>> 
>> Sampler<T> {
>>    T sample();
>> }
>> 
>> To be complimented with:
>> 
>> SamplerProvider<T> {
>>    Sampler<T> create(UniformRandomProvider rng);
>> }
>> 
>> So the library would require:
>> 
>> SamplerProvider<T>
>> DiscreteSamplerProvider
>> ContinuousSamplerProvider
>> 
>> Any sampler can choose to implement being a Provider. There are some cases where
it is mute. For example a ZigguratNormalizedGaussianSampler just stores the rng in the constructor.
However it could still be a Provider just the method would only call the constructor. It would
allow writing a generic multi-threaded application that just uses e.g. a DiscreteSamplerProvider
to create samplers for each thread. You can then drop in the actual implementation you require.
For example you could swap the type of PoissonSampler in your simulation by swapping the provider
for the Poisson distribution.
>> 
>> How does that sound?
> 
> Fine to have
>  DiscreteSamplerProvider
>  ContinuousSamplerProvider
> [Perhaps the "Supplier" suffix would be better to avoid confusion with
> "UniformRandomProvider".]
> 
> At first sight, I don't think that the generic interface would have
> any actual use since, ultimately, the return value of "sample()"
> will be either "int" or "double" (no polymorphism).
> 

The generic interface is for the samplers that are typed for collections and currently return
a sample T, or those that return arrays. It would not be for Integer or Double from the probability
distribution samplers. Here are what could use it:

CombinationSampler implements Sampler<int[]>
PermutationSampler implements Sampler<int[]>
CollectionSampler implements Sampler<T>
DiscreteProbabilityCollectionSampler implements Sampler<T>

All are in the package org.apache.commons.rng.sampling.

Each could also implement SamplerSupplier<T>.

The set-up cost for the CombinationSampler/PermutationSampler would not be much different
from the constructor and no state can be shared. No real benefit here other than convenience.
But the two CollectionSamplers could shared the final collection that is created and stored
from the constructor input data. For the case of a large discrete probability collection sampler
this could be a noticeable memory footprint as it also stores the cumulative distribution
table. This would also save on the construction cost by not having to recompute it.

Alex


> Gilles
> 
>> 
>> Alex
>> 
>> 
>> 
>>> I'm a bit wary that this would compound two different functionalities:
>>> * data generator (method "sample"),
>>> * generator generator (method "newInstance").
>>> [But I currently don't have an example where this would be a problem.]
>>> 
>>> Regards,
>>> Gilles
>>> 
>>>> Alex
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/RNG-91 <https://issues.apache.org/jira/browse/RNG-91>
>>>> 
>>>> [2] kB, or possibly MB, of tabulated data
>>>> 
>>>> [3] Set-up cost for a Poisson sampler is in the order of 30 to 165 times
>>>> as long as a SmallMeanPoissonSampler for a mean of 2 and 32. Note
>>>> however that construction still takes only 1.1 and 4.5 microseconds for
>>>> the "long" time.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message