[ https://issues.apache.org/jira/browse/RNG50?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=16568231#comment16568231
]
Alex D Herbert edited comment on RNG50 at 8/3/18 9:55 PM:

*Note: I've edited this as the original post used a random seed for each benchmark and the
timings were inconsistent. Now the results show timings using the same seed; the cache is
faster.*
=====
I've created a benchmark for a cache for the state of the {{LargeMeanPoissonSampler}}. The
implementing classes are all package level scope and so the details are hidden from the user.
The API is a slight change from above. First create the cache with an expected range of the
mean. Then use the cache to get a {{DiscreteSampler}}:
{code:java}
double minMean = 20;
double maxMean = 256;
PoissonSamplerCache cache = new PoissonSamplerCache(minMean, maxMean);
UniformRandomProvider rng = RandomSource.create(RandomSource.SPLIT_MIX_64);
for (int i=0; i<1000; i++) {
double mean = minMean + rng.nextDouble() * (maxMean  minMean);
DiscreteSampler s1 = cache.getPoissonSampler(rng, mean);
// This is equivalent to:
DiscreteSampler s2 = new PoissonSampler(rng, mean);
}
{code}
Basically the cache should provide a sampler such that it matches the constructor for the
{{PoissonSampler}}, i.e. what would be produced without using a cache.
I've tested this with a small and medium range using 10 benchmark iterations. I've used a
{{AtomicReferenceArray}} for a threadsafe version and a plain array for comparison. Here
are the relative performance figures:
SourceRangeNameRelative Score
SPLIT_MIX_64256NoCache1
SPLIT_MIX_64256SyncCache0.521317205515385
SPLIT_MIX_64256Cache0.437231152662361
SPLIT_MIX_641024NoCache1
SPLIT_MIX_641024SyncCache0.450575830688281
SPLIT_MIX_641024Cache0.399218435578451
WELL_19937_C256NoCache1
WELL_19937_C256SyncCache0.66196477851107
WELL_19937_C256Cache0.632173341651003
WELL_19937_C1024NoCache1
WELL_19937_C1024Cache0.650754710270584
WELL_19937_C1024SyncCache0.630520058831807
WELL_44497_B256NoCache1
WELL_44497_B256SyncCache0.736007125853911
WELL_44497_B256Cache0.731063880702168
WELL_44497_B1024NoCache1
WELL_44497_B1024SyncCache0.65585392035459
WELL_44497_B1024Cache0.609014616171805
Observations:
* The cache is faster; twice as fast for the SPLIT_MIX RNG.
* When the RNG is slower (SPLIT_MIX < WELL_19937_C < WELL_44497_B) the speed up is
reduced. This is expected given the initialisation is a smaller part of the algorithm run
time.
* The plain version is faster, but not by much.
I'll need to repeat this with more iterations, different ranges and with a totally idle computer
but it shows the cache may be worth adding to the library.
The difference between the plain and sync version is small so there's not much point in having
a nonsynchronised version; this makes it more flexible for use in multithreaded simulations.
Here is the current implementation:
{code:java}
package org.apache.commons.rng.sampling.distribution;
import java.util.concurrent.atomic.AtomicReferenceArray;
import org.apache.commons.rng.UniformRandomProvider;
import org.apache.commons.rng.sampling.distribution.PoissonSampler;
import org.apache.commons.rng.sampling.distribution.LargeMeanPoissonSampler.LargeMeanPoissonSamplerState;
/**
* Create a sampler for the
* <a href="http://mathworld.wolfram.com/PoissonDistribution.html">Poisson
* distribution</a> using a cache to minimise construction cost.
* <p>
* The cache will return a sampler equivalent to
* {@link org.apache.commons.rng.sampling.distribution#PoissonSampler(UniformRandomProvider,
double)}.
* <p>
* The cache allows the {@link PoissonSampler} construction cost to be minimised
* for low size Poisson samples. The cache is advantageous under the following
* conditions:
* <ul>
* <li>The mean of the Poisson distribution falls within a known range.</li>
* <li>The sample size to be made with the <strong>same</strong> sampler
is
* small.</li>
* </ul>
* <p>
* If the sample size to be made with the <strong>same</strong> sampler is large
* then the construction cost is minimal compared to the sampling time.
* <p>
* Performance improvement is dependent on the speed of the
* {@link UniformRandomProvider}. A fast provider can obtain a twofold speed
* improvement for a singleuse Poisson sampler.
* <p>
* The cache is thread safe.
*/
public class PoissonSamplerCache {
/**
* The minimum N covered by the cache where
* {@code N = (int)Math.floor(mean)}.
*/
private final int minN;
/**
* The maximum N covered by the cache where
* {@code N = (int)Math.floor(mean)}.
*/
private final int maxN;
/** The cache of states between {@link minN} and {@link maxN}. */
private final AtomicReferenceArray<LargeMeanPoissonSamplerState> values;
/**
* @param minMean The minimum mean covered by the cache.
* @param maxMean The maximum mean covered by the cache.
* @throws IllegalArgumentException if {@code maxMean < minMean}
*/
public PoissonSamplerCache(double minMean, double maxMean) {
// Although a mean of 0 is invalid for a Poisson sampler this case
// is handled to make the cache user friendly. Any low means will
// be handled by the SmallMeanPoissonSampler and not cached.
if (minMean < 0) {
minMean = 0;
}
// Allow minMean == maxMean so that the cache can be used across
// concurrent threads to create samplers with distinct RNGs and the
// same mean.
if (maxMean < minMean) {
throw new IllegalArgumentException(
"Max mean: " + maxMean + " < " + minMean);
}
// The cache can only be used for the LargeMeanPoissonSampler.
if (maxMean < WrapperPoissonSampler.PIVOT) {
// The upper limit is too small so no cache will be used.
// This class will just construct new samplers.
minN = 0;
maxN = 0;
values = null;
} else {
// Convert the mean into integers.
// Note the minimum is clipped to the algorithm switch point.
this.minN = (int) Math
.floor(Math.max(minMean, WrapperPoissonSampler.PIVOT));
this.maxN = (int) Math.floor(maxMean);
values = new AtomicReferenceArray<>(maxN  minN + 1);
}
}
/**
* Creates a Poisson sampler. The returned sampler will function exactly the
* same as
* {@link org.apache.commons.rng.sampling.distribution#PoissonSampler(UniformRandomProvider,
double)}.
* <p>
* A value of {@code mean} outside the range of the cache is valid.
*
* @param rng Generator of uniformly distributed random numbers.
* @param mean Mean.
* @return A Poisson sampler
* @throws IllegalArgumentException if {@code mean <= 0}.
*/
public DiscreteSampler getPoissonSampler(UniformRandomProvider rng,
double mean) {
// Ensure the same functionality as the PoissonSampler by
// using a SmallMeanPoissonSampler under the switch point.
if (mean < WrapperPoissonSampler.PIVOT)
return new SmallMeanPoissonSampler(rng, mean);
// Convert the mean into an integer.
final int n = (int) Math.floor(mean);
if (n > maxN)
// Outside the range of the cache.
return new LargeMeanPoissonSampler(rng, mean);
// Look in the cache for a state that can be reused.
// Note: The cache is offset by minN.
final int index = n  minN;
LargeMeanPoissonSamplerState state = values.get(index);
if (state == null) {
// Compute and store for reuse
state = LargeMeanPoissonSamplerState.create(n);
// Only set this once as it will be the same.
// Allows concurrent threads to compute the state without
// excess synchronisation once the state is stored.
values.compareAndSet(index, null, state);
}
// Compute the remaining fraction of the mean
final double lambdaFractional = mean  n;
return new LargeMeanPoissonSampler(rng, state, lambdaFractional);
}
}
{code}
I've left out the implementation of the {{LargeMeanPoissonSamplerState}}. Suffice to say that
the {{LargeMeanPoissonSampler}} computes it and it is a nested class. This maintains all the
functionality of the {{LargeMeanPoissonSampler}} in one place.
Q. Should I pursue this addition?
was (Author: alexherbert):
*Note: I've edited this as the original post used a random seed for each benchmark. With the
same seed the plain version is faster.*
=====
I've created a benchmark for a cache for the state of the {{LargeMeanPoissonSampler}}. The
implementing classes are all package level scope and so the details are hidden from the user.
The API is a slight change from above. First create the cache with an expected range of the
mean. Then use the cache to get a {{DiscreteSampler}}:
{code:java}
double minMean = 20;
double maxMean = 256;
PoissonSamplerCache cache = new PoissonSamplerCache(minMean, maxMean);
UniformRandomProvider rng = RandomSource.create(RandomSource.SPLIT_MIX_64);
for (int i=0; i<1000; i++) {
double mean = minMean + rng.nextDouble() * (maxMean  minMean);
DiscreteSampler s1 = cache.getPoissonSampler(rng, mean);
// This is equivalent to:
DiscreteSampler s2 = new PoissonSampler(rng, mean);
}
{code}
Basically the cache should provide a sampler such that it matches the constructor for the
{{PoissonSampler}}, i.e. what would be produced without using a cache.
I've tested this with a small and medium range using 10 benchmark iterations. I've used a
{{AtomicReferenceArray}} for a threadsafe version and a plain array for comparison. Here
are the relative performance figures:
SourceRangeNameRelative Score
SPLIT_MIX_64256NoCache1
SPLIT_MIX_64256SyncCache0.521317205515385
SPLIT_MIX_64256Cache0.437231152662361
SPLIT_MIX_641024NoCache1
SPLIT_MIX_641024SyncCache0.450575830688281
SPLIT_MIX_641024Cache0.399218435578451
WELL_19937_C256NoCache1
WELL_19937_C256SyncCache0.66196477851107
WELL_19937_C256Cache0.632173341651003
WELL_19937_C1024NoCache1
WELL_19937_C1024Cache0.650754710270584
WELL_19937_C1024SyncCache0.630520058831807
WELL_44497_B256NoCache1
WELL_44497_B256SyncCache0.736007125853911
WELL_44497_B256Cache0.731063880702168
WELL_44497_B1024NoCache1
WELL_44497_B1024SyncCache0.65585392035459
WELL_44497_B1024Cache0.609014616171805
Observations:
* The cache is faster; twice as fast for the SPLIT_MIX RNG.
* When the RNG is slower (SPLIT_MIX < WELL_19937_C < WELL_44497_B) the speed up is
reduced. This is expected given the initialisation is a smaller part of the algorithm run
time.
* The plain version is faster, but not by much.
I'll need to repeat this with more iterations, different ranges and with a totally idle computer
but it shows the cache may be worth adding to the library.
The difference between the plain and sync version is small so there's not much point in having
a nonsynchronised version; this makes it more flexible for use in multithreaded simulations.
Here is the current implementation:
{code:java}
package org.apache.commons.rng.sampling.distribution;
import java.util.concurrent.atomic.AtomicReferenceArray;
import org.apache.commons.rng.UniformRandomProvider;
import org.apache.commons.rng.sampling.distribution.PoissonSampler;
import org.apache.commons.rng.sampling.distribution.LargeMeanPoissonSampler.LargeMeanPoissonSamplerState;
/**
* Create a sampler for the
* <a href="http://mathworld.wolfram.com/PoissonDistribution.html">Poisson
* distribution</a> using a cache to minimise construction cost.
* <p>
* The cache will return a sampler equivalent to
* {@link org.apache.commons.rng.sampling.distribution#PoissonSampler(UniformRandomProvider,
double)}.
* <p>
* The cache allows the {@link PoissonSampler} construction cost to be minimised
* for low size Poisson samples. The cache is advantageous under the following
* conditions:
* <ul>
* <li>The mean of the Poisson distribution falls within a known range.</li>
* <li>The sample size to be made with the <strong>same</strong> sampler
is
* small.</li>
* </ul>
* <p>
* If the sample size to be made with the <strong>same</strong> sampler is large
* then the construction cost is minimal compared to the sampling time.
* <p>
* Performance improvement is dependent on the speed of the
* {@link UniformRandomProvider}. A fast provider can obtain a twofold speed
* improvement for a singleuse Poisson sampler.
* <p>
* The cache is thread safe.
*/
public class PoissonSamplerCache {
/**
* The minimum N covered by the cache where
* {@code N = (int)Math.floor(mean)}.
*/
private final int minN;
/**
* The maximum N covered by the cache where
* {@code N = (int)Math.floor(mean)}.
*/
private final int maxN;
/** The cache of states between {@link minN} and {@link maxN}. */
private final AtomicReferenceArray<LargeMeanPoissonSamplerState> values;
/**
* @param minMean The minimum mean covered by the cache.
* @param maxMean The maximum mean covered by the cache.
* @throws IllegalArgumentException if {@code maxMean < minMean}
*/
public PoissonSamplerCache(double minMean, double maxMean) {
// Although a mean of 0 is invalid for a Poisson sampler this case
// is handled to make the cache user friendly. Any low means will
// be handled by the SmallMeanPoissonSampler and not cached.
if (minMean < 0) {
minMean = 0;
}
// Allow minMean == maxMean so that the cache can be used across
// concurrent threads to create samplers with distinct RNGs and the
// same mean.
if (maxMean < minMean) {
throw new IllegalArgumentException(
"Max mean: " + maxMean + " < " + minMean);
}
// The cache can only be used for the LargeMeanPoissonSampler.
if (maxMean < WrapperPoissonSampler.PIVOT) {
// The upper limit is too small so no cache will be used.
// This class will just construct new samplers.
minN = 0;
maxN = 0;
values = null;
} else {
// Convert the mean into integers.
// Note the minimum is clipped to the algorithm switch point.
this.minN = (int) Math
.floor(Math.max(minMean, WrapperPoissonSampler.PIVOT));
this.maxN = (int) Math.floor(maxMean);
values = new AtomicReferenceArray<>(maxN  minN + 1);
}
}
/**
* Creates a Poisson sampler. The returned sampler will function exactly the
* same as
* {@link org.apache.commons.rng.sampling.distribution#PoissonSampler(UniformRandomProvider,
double)}.
* <p>
* A value of {@code mean} outside the range of the cache is valid.
*
* @param rng Generator of uniformly distributed random numbers.
* @param mean Mean.
* @return A Poisson sampler
* @throws IllegalArgumentException if {@code mean <= 0}.
*/
public DiscreteSampler getPoissonSampler(UniformRandomProvider rng,
double mean) {
// Ensure the same functionality as the PoissonSampler by
// using a SmallMeanPoissonSampler under the switch point.
if (mean < WrapperPoissonSampler.PIVOT)
return new SmallMeanPoissonSampler(rng, mean);
// Convert the mean into an integer.
final int n = (int) Math.floor(mean);
if (n > maxN)
// Outside the range of the cache.
return new LargeMeanPoissonSampler(rng, mean);
// Look in the cache for a state that can be reused.
// Note: The cache is offset by minN.
final int index = n  minN;
LargeMeanPoissonSamplerState state = values.get(index);
if (state == null) {
// Compute and store for reuse
state = LargeMeanPoissonSamplerState.create(n);
// Only set this once as it will be the same.
// Allows concurrent threads to compute the state without
// excess synchronisation once the state is stored.
values.compareAndSet(index, null, state);
}
// Compute the remaining fraction of the mean
final double lambdaFractional = mean  n;
return new LargeMeanPoissonSampler(rng, state, lambdaFractional);
}
}
{code}
I've left out the implementation of the {{LargeMeanPoissonSamplerState}}. Suffice to say that
the {{LargeMeanPoissonSampler}} computes it and it is a nested class. This maintains all the
functionality of the {{LargeMeanPoissonSampler}} in one place.
Q. Should I pursue this addition?
> PoissonSampler single use speed improvements
> 
>
> Key: RNG50
> URL: https://issues.apache.org/jira/browse/RNG50
> Project: Commons RNG
> Issue Type: Improvement
> Affects Versions: 1.0
> Reporter: Alex D Herbert
> Priority: Minor
> Attachments: PoissonSamplerTest.java, jmhresult.csv
>
>
> The Sampler architecture of {{org.apache.commons.rng.sampling.distribution}} is nicely
written for fast sampling of small dataset sizes. The constructors for the samplers do not
check the input parameters are valid for the respective distributions (in contrast to the
old {{org.apache.commons.math3.random.distribution}} classes). I assume this is a design choice
for speed. Thus most of the samplers can be used within a loop to sample just one value with
very little overhead.
> The {{PoissonSampler}} precomputes log factorial numbers upon construction if the mean
is above 40. This is done using the {{InternalUtils.FactorialLog}} class. As of version 1.0
this internal class is currently only used in the {{PoissonSampler}}.
> The cache size is limited to 2*PIVOT (where PIVOT=40). But it creates and precomputes
the cache every time a PoissonSampler is constructed if the mean is above the PIVOT value.
> Why not create this once in a static block for the PoissonSampler?
> {code:java}
> /** {@code log(n!)}. */
> private static final FactorialLog factorialLog;
>
> static
> {
> factorialLog = FactorialLog.create().withCache((int) (2 * PoissonSampler.PIVOT));
> }
> {code}
> This will make the construction cost of a new {{PoissonSampler}} negligible. If the table
is computed dynamically as a static construction method then the overhead will be in the first
use. Thus the following call will be much faster:
> {code:java}
> UniformRandomProvider rng = ...;
> int value = new PoissonSampler(rng, 50).sample();
> {code}
> I have tested this modification (see attached file) and the results are:
> {noformat}
> Mean 40 Single construction ( 7330792) vs Loop construction
(24334724) (3.319522.2x faster)
> Mean 40 Single construction ( 7330792) vs Loop construction with static FactorialLog
( 7990656) (1.090013.2x faster)
> Mean 50 Single construction ( 6390303) vs Loop construction
(19389026) (3.034132.2x faster)
> Mean 50 Single construction ( 6390303) vs Loop construction with static FactorialLog
( 6146556) (0.961857.2x faster)
> Mean 60 Single construction ( 6041165) vs Loop construction
(21337678) (3.532047.2x faster)
> Mean 60 Single construction ( 6041165) vs Loop construction with static FactorialLog
( 5329129) (0.882136.2x faster)
> Mean 70 Single construction ( 6064003) vs Loop construction
(23963516) (3.951765.2x faster)
> Mean 70 Single construction ( 6064003) vs Loop construction with static FactorialLog
( 5306081) (0.875013.2x faster)
> Mean 80 Single construction ( 6064772) vs Loop construction
(26381365) (4.349935.2x faster)
> Mean 80 Single construction ( 6064772) vs Loop construction with static FactorialLog
( 6341274) (1.045591.2x faster)
> {noformat}
> Thus the speed improvements would be approximately 34 fold for single use Poisson sampling.

This message was sent by Atlassian JIRA
(v7.6.3#76005)
