commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex D Herbert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RNG-50) PoissonSampler single use speed improvements
Date Wed, 01 Aug 2018 21:37:00 GMT

    [ https://issues.apache.org/jira/browse/RNG-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566020#comment-16566020
] 

Alex D Herbert commented on RNG-50:
-----------------------------------

{quote}looks like it will silently recompute everything all the time if the user happened
to have mistyped its cache interval
{quote}
Well it won't compute more than what you would have to do if you didn't use a cache (it's
just the initialisation for the algorithm), so no gain but no real loss either.
{quote}if one creates a cache for values between 0 and 255, then if some caller requests 345
later, then keep that too in the cache, and so on
{quote}
An expandable cache requires an advanced data structure, e.g. a HashTable, in order to support
the full range of integers that could be requested. And the speed of a hash table look-up
within the cache could become expensive.

I think that the use of the cache is quite specialised so it is OK to leave it the user to
know what they are doing. It's all a bit mute anyway without knowing it will make a noticeable
difference on the speed. So I'll add a version as described above to my test project JMH benchmark
to see it is worth pursuing.

*Note*

With regard to the possible range for the mean I've noticed that the computation of {{log(n!)}}
does assume the double valued mean can be cast to an integer. This is also the same in the
original algorithm in Commons Math 3.6.1 (which used {{CombinatoricsUtils.factorialLog(int)}}).
Perhaps the algorithm should also check the mean is not above {{Integer.MAX_VALUE}}. This
would be a sensible limit because: (a) the sample() method is limited to returning an integer
and with a mean of Integer.MAX_VALUE half the sample would overflow; (b) a Poisson distribution
is approximated by a Gaussian at high mean so could be used a substitute anyway. Commons Math
3.6.1 even has {{PoissonDistribution.normalApproximateProbability(int)}}. Actually support
for this is done by creating a {{NormalDistribution}} each time a {{PoissonDistribution}}
is created which is expensive if it's never used. I've made a copy implementation that avoids
this for my simulations. However the new Commons RNG {{PoissonSampler}} will be much better
and I'll be switching code over to using that instead when the resolution for this ticket
is released.

 

> PoissonSampler single use speed improvements
> --------------------------------------------
>
>                 Key: RNG-50
>                 URL: https://issues.apache.org/jira/browse/RNG-50
>             Project: Commons RNG
>          Issue Type: Improvement
>    Affects Versions: 1.0
>            Reporter: Alex D Herbert
>            Priority: Minor
>         Attachments: PoissonSamplerTest.java, jmh-result.csv
>
>
> The Sampler architecture of {{org.apache.commons.rng.sampling.distribution}} is nicely
written for fast sampling of small dataset sizes. The constructors for the samplers do not
check the input parameters are valid for the respective distributions (in contrast to the
old {{org.apache.commons.math3.random.distribution}} classes). I assume this is a design choice
for speed. Thus most of the samplers can be used within a loop to sample just one value with
very little overhead.
> The {{PoissonSampler}} precomputes log factorial numbers upon construction if the mean
is above 40. This is done using the {{InternalUtils.FactorialLog}} class. As of version 1.0
this internal class is currently only used in the {{PoissonSampler}}.
> The cache size is limited to 2*PIVOT (where PIVOT=40). But it creates and precomputes
the cache every time a PoissonSampler is constructed if the mean is above the PIVOT value.
> Why not create this once in a static block for the PoissonSampler?
> {code:java}
> /** {@code log(n!)}. */
> private static final FactorialLog factorialLog;
>      
> static 
> {
>     factorialLog = FactorialLog.create().withCache((int) (2 * PoissonSampler.PIVOT));
> }
> {code}
> This will make the construction cost of a new {{PoissonSampler}} negligible. If the table
is computed dynamically as a static construction method then the overhead will be in the first
use. Thus the following call will be much faster:
> {code:java}
> UniformRandomProvider rng = ...;
> int value = new PoissonSampler(rng, 50).sample();
> {code}
> I have tested this modification (see attached file) and the results are:
> {noformat}
> Mean 40  Single construction ( 7330792) vs Loop construction                        
 (24334724)   (3.319522.2x faster)
> Mean 40  Single construction ( 7330792) vs Loop construction with static FactorialLog
( 7990656)   (1.090013.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction                        
 (19389026)   (3.034132.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction with static FactorialLog
( 6146556)   (0.961857.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction                        
 (21337678)   (3.532047.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction with static FactorialLog
( 5329129)   (0.882136.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction                        
 (23963516)   (3.951765.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction with static FactorialLog
( 5306081)   (0.875013.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction                        
 (26381365)   (4.349935.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction with static FactorialLog
( 6341274)   (1.045591.2x faster)
> {noformat}
> Thus the speed improvements would be approximately 3-4 fold for single use Poisson sampling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message