commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [rng] RNG-101 new MarsagliaTsangWang discrete probability sampler
Date Mon, 13 May 2019 15:01:29 GMT


> On 11 May 2019, at 23:25, Alex Herbert <alex.d.herbert@gmail.com> wrote:
> 
> 
> 
>> On 11 May 2019, at 22:58, Gilles Sadowski <gilleseran@gmail.com> wrote:
>> 
>> Le sam. 11 mai 2019 à 23:32, Alex Herbert <alex.d.herbert@gmail.com> a écrit
:
>>> 
>>> 
>>> 
>>>> On 10 May 2019, at 15:07, Gilles Sadowski <gilleseran@gmail.com> wrote:
>>>> 
>>>> Hi.
>>>> 
>>>> Le ven. 10 mai 2019 à 15:53, Alex Herbert <alex.d.herbert@gmail.com <mailto:alex.d.herbert@gmail.com>>
a écrit :
>>>>> 
>>>>> 
>>>>> On 10/05/2019 14:27, Gilles Sadowski wrote:
>>>>>> Hi Alex.
>>>>>> 
>>>>>> Le ven. 10 mai 2019 à 13:57, Alex Herbert <alex.d.herbert@gmail.com>
a écrit :
>>>>>>> Can I get a review of the PR for RNG-101 please.
>>>>>> Thanks for this work!
>>>>>> 
>>>>>> I didn't go into the details; however, I see many fields and methods
like
>>>>>> table1 ... table5
>>>>>> fillTable1 ... fillTable5
>>>>>> getTable1 ... getTable5
>>>>>> Wouldn't it be possible to use a 2D table:
>>>>>> table[5][];
>>>>>> so that e.g. only one "fillTable(int tableIndex, /* other args */)"
method
>>>>>> is necessary (where "tableIndex" runs from 0 to 4)?
>>>>> 
>>>>> Yes. The design is based around using 5 tables as per the example code.
>>>>> 
>>>>> The sample() method knows which table it needs so it can directly jump
>>>>> to the table in question. I'd have to look at the difference in speed
>>>>> when using a 2D table as you are adding another array access but
>>>>> reducing the number of possible method calls (although you still need
a
>>>>> method call). Maybe this will be optimised out by the JVM.
>>>>> 
>>>>> If the speed is not a factor then I'll rewrite it. Otherwise it's
>>>>> probably better done for speed as this is the entire point of the
>>>>> sampler given it disregards any probability under 2^-31 (i.e. it's not
a
>>>>> perfectly fair sampler).
>>>>> 
>>>>> Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
>>>>> states using 3 tables of base 2^10 then you get a speed increase
>>>>> (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
>>>>> tables of base 2^15 does not make it much faster again.
>>>>> 
>>>>> I'll have a rethink to see if I can make the design work for different
>>>>> base sizes.
>>>> 
>>>> That could be an extension made easier with the 2D table, but
>>>> I quite agree that given the relatively minor speed improvement
>>>> to be expected, it is not the main reason; the rationale was just to
>>>> make the code a more compact and a little easier to grasp (IMHO).
>>>> 
>>>> Gilles
>>> 
>>> I’ve done a more extensive look at the implications of changing the implementation
of the algorithm. This tested using: 1D or 2D tables; interfaced storage to dynamic table
types; base 6 or base 10 for the algorithm; and allowing the base to be chosen. Results are
in the Jira ticket. Basically 2D arrays are slower and supporting choices for the backing
storage or base of the algorithm is slower.
>>> 
>>> To support the Poisson and Binomial samplers only requires 16-bit storage. So
a dedicated sampler using base 6 and short for the tables will be the best compromise between
storage space and speed. The base 10 sampler is faster but takes about 9-10x more space in
memory.
>>> 
>>> Note I originally wrote the sampler to use only 16-bit storage. I then modified
it to use dynamic storage without measuring performance. And so I made it slightly slower.
>>> 
>>> The question is does the library even need to have a 32-bit storage implementation?
This would be used for a probability distribution with more than 2^16 different possible samples.
I think this would be an edge case. Here the memory requirements will be in the tens of MB
at a minimum but may balloon to become much larger. In this case a different algorithm such
as the Alias method or a guide table is more memory stable as it only requires 12 bytes of
storage per index, irrespective of the shape of the probability distribution.
>>> 
>>> If different implementations (of this algorithm) are added to the library then
the effect of using a sampler that dynamically chooses the storage space and/or base for the
algorithm is noticeable in the performance. In this case these would be better served using
a factory:
>>> 
>>> class DiscreteProbabilitySamplerFactory {
>>>   DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, double[])
>>> }
>>> 
>>> But if specifically targeting this algorithm it could be:
>>> 
>>> class MarsagliaTsangWangDiscreteProbabilitySamplerFactory {
>>>   DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, double[],
boolean useBase10)
>>> }
>>> 
>>> Or something similar. The user can then choose to use a base 10 algorithm if
memory is not a concern.
>>> 
>>> I am wary of making this too complicated for just this sampler. So I would vote
for ignoring the base 10 version and sticking to the interfaced storage implementation in
the current PR or reverting back to the 16-bit storage and not supporting very large distributions.
In the later case this is at least partially supported by the fact that the method only supports
probabilities down to 1^-31. Anything else is set to zero after scaling by 2^30 and rounding.
So large probability distributions are more likely to have values that are misrepresented
due to the conversion of probabilities to fractions with a base of 2^30.
>>> 
>>> Thoughts on this?
>> 
>> I agree to not make it more complex than necessary for the expected
>> use-case.
>> Anyway, these "implementation details" are internal/private and can be
>> changed without notice if the need arises.
>> 
> 
> OK. I’ll revert to the 16-bit implementation as that is all that is required for the
Poisson and Binomial distributions.
> 
> I think that the performance effect of calling a delegate that was observed in the tests
on all the plain discrete probability samplers will also be seen for the Poisson and Binomial
samplers. The actual code for these named distribution samplers is all in the constructor
to create the normalised probability table. This is then passed to the MarsagliaTsangWangDiscreteProbabilitySampler
which is used as a delegate. So these classes could be replaced with factory methods in the
MarsagliaTsangWangDiscreteProbabilitySampler:
> 
> public static MarsagliaTsangWangDiscreteProbabilitySampler forPoissonDistribution(UniformRandomProvider
rng, double mean);
> public static MarsagliaTsangWangDiscreteProbabilitySampler forBinomialDistribution(UniformRandomProvider
rng, int trials, double p);
> public static MarsagliaTsangWangDiscreteProbabilitySampler forProbabilityDistrbution(UniformRandomProvider
rng, double[] probabilities);
> 
> There would be no public constructor for the sampler, only factory methods.
> 
> All the details of backing storage are then hidden with private classes.
> 
> I’ll create a version like this and try it through the speed test. In theory it should
be able to pick the best implementation for all cases. I’ll stick to using base 6 for now
but base 10 could be added to the API in the future if required.

I have posted the results of the new version to the Jira ticket.

Basically the use of a wrapper around the MarsagliaTsangWangDiscreteSampler for the Poisson
or Binomial distribution should be avoided. I’ve written the code to have implementations
that use 8, 16, or 32 bit storage. These are all private inner classes that inherit from an
abstract outer class MarsagliaTsangWangDiscreteSampler. The public API is only factory methods
(no direct constructors):

public static MarsagliaTsangWangDiscreteSampler createProbabilityDistrbution(UniformRandomProvider
rng, double[] probabilities);
public static MarsagliaTsangWangDiscreteSampler createPoissonDistribution(UniformRandomProvider
rng, double mean);
public static MarsagliaTsangWangDiscreteSampler createBinomialDistribution(UniformRandomProvider
rng, int trials, double p);

There are no package private methods. Everything is private. So the implementation details
can be changed in the future if necessary.

Notes:
- It makes the class code very big. Separating to separate classes would mean exposing package-private
methods.
- The return type is a class (rather than the DiscreteSampler interface). This is mainly to
allow MarsagliaTsangWangDiscreteSampler to implement the new SharedStateSampler<T> interface
when it is added to the library.

I will push the new version to the current PR.


> 
> 
>> Gilles
>> 
>>> 
>>> Alex
>>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> [...]
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:dev-unsubscribe@commons.apache.org>
>> For additional commands, e-mail: dev-help@commons.apache.org <mailto:dev-help@commons.apache.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message