commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Herbert <alex.d.herb...@gmail.com>
Subject Re: [rng] RNG-101 new MarsagliaTsangWang discrete probability sampler
Date Fri, 10 May 2019 14:24:36 GMT

On 10/05/2019 15:07, Gilles Sadowski wrote:
> Hi.
>
> Le ven. 10 mai 2019 à 15:53, Alex Herbert <alex.d.herbert@gmail.com> a écrit
:
>>
>> On 10/05/2019 14:27, Gilles Sadowski wrote:
>>> Hi Alex.
>>>
>>> Le ven. 10 mai 2019 à 13:57, Alex Herbert <alex.d.herbert@gmail.com> a
écrit :
>>>> Can I get a review of the PR for RNG-101 please.
>>> Thanks for this work!
>>>
>>> I didn't go into the details; however, I see many fields and methods like
>>>     table1 ... table5
>>>     fillTable1 ... fillTable5
>>>     getTable1 ... getTable5
>>> Wouldn't it be possible to use a 2D table:
>>>     table[5][];
>>> so that e.g. only one "fillTable(int tableIndex, /* other args */)" method
>>> is necessary (where "tableIndex" runs from 0 to 4)?
>> Yes. The design is based around using 5 tables as per the example code.
>>
>> The sample() method knows which table it needs so it can directly jump
>> to the table in question. I'd have to look at the difference in speed
>> when using a 2D table as you are adding another array access but
>> reducing the number of possible method calls (although you still need a
>> method call). Maybe this will be optimised out by the JVM.
>>
>> If the speed is not a factor then I'll rewrite it. Otherwise it's
>> probably better done for speed as this is the entire point of the
>> sampler given it disregards any probability under 2^-31 (i.e. it's not a
>> perfectly fair sampler).
>>
>> Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
>> states using 3 tables of base 2^10 then you get a speed increase
>> (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
>> tables of base 2^15 does not make it much faster again.
>>
>> I'll have a rethink to see if I can make the design work for different
>> base sizes.
> That could be an extension made easier with the 2D table, but
> I quite agree that given the relatively minor speed improvement
> to be expected, it is not the main reason; the rationale was just to
> make the code a more compact and a little easier to grasp (IMHO).
>
> Gilles

Direct from JMH:

Benchmark                             (randomSourceName)                       (samplerType)
 Mode  Cnt  Score   Error  Units
DiscreteSamplersPerformance.baseline                 N/A                                 N/A
 avgt    5  2.038 ± 0.015  ns/op
DiscreteSamplersPerformance.nextInt         SPLIT_MIX_64                                 N/A
 avgt    5  3.752 ± 1.708  ns/op
DiscreteSamplersPerformance.sample          SPLIT_MIX_64   MarsagliaTsangWangDiscreteSampler
 avgt    5  5.769 ± 0.005  ns/op
DiscreteSamplersPerformance.sample          SPLIT_MIX_64  MarsagliaTsangWangDiscreteSampler2
 avgt    5  5.954 ± 0.038  ns/op

Note: The sampler is very fast!

The baseline is just the timing for JMH to consume an int value. This 
does not even hit the generator. That is the nextInt() method.

So to get the time for the sampler we subtract the nextInt baseline (I 
used the median as one sample was slow):

5.752 - 3.559 = 2.193
5.954 - 3.559 = 2.395

It's about 10% slower using a 2D table.

I've only run this on the Discrete sampler. It is a bit artificial as 
the probability distribution is [0.1, 0.2, 0.3, 0.4] so the tables may 
not be very well filled. I'll repeat for the Poisson sampler as that has 
a long tail and larger tables.

I'm currently favouring leaving the code as it is. It matches the 
original source paper so anyone reading that will understand the code. 
Perhaps it needs more documentation.


>
>>> The diff for "DiscreteSamplersList.java" refers to
>>>      MarsagliaTsangWangBinomialSampler
>>> but
>>>     MarsagliaTsangWangSmallMeanPoissonSampler
>>> seems to be missing.
>> Oops, I missed adding that back. I built the PR from code where I was
>> testing lots of implementations.
>>
>> I've just added it back and it is still passing locally. Travis should
>> see that too as I pushed the change to the PR.
>>
>>> Regards,
>>> Gilles
>>>
>>>> This is a new sampler based on the source code from the paper:
>>>>
>>>> George Marsaglia, Wai Wan Tsang, Jingbo Wang (2004)
>>>> Fast Generation of Discrete Random Variables.
>>>> Journal of Statistical Software. Vol. 11, Issue. 3, pp. 1-11.
>>>>
>>>> https://www.jstatsoft.org/article/view/v011i03
>>>>
>>>> The code has no explicit licence.
>>>>
>>>> The paper states:
>>>>
>>>> "We have provided C versions of the two methods described here, for
>>>> inclusion in the “Browse
>>>> files”section of the journal. ... You may then want to examine the
>>>> components of the two files, for illumination
>>>> or for extracting portions that might be usefully applied to your
>>>> discrete distributions."
>>>>
>>>> So I assuming that it can be incorporated with little modification.
>>>>
>>>> The Java implementation has been rewritten to allow the storage to be
>>>> optimised for the required size. The generation of the tables has been
>>>> adapted appropriately and checks have been added on the input parameters
>>>> to ensure the sampler does not generate exceptions once constructed (I
>>>> found out the hard way that the original code was not entirely correct).
>>>>
>>>> Thanks.
>>>>
>>>> Alex
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message