[ https://issues.apache.org/jira/browse/MATH1441?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=16341976#comment16341976
]
Max Aller commented on MATH1441:

I concur that constructing the TDistribution object isn't the bottleneck here – it's the
`inverseCumulativeProbability` calculation immediately after that's expensive. So...an effective
caching solution will also need to encompass that calculation. Though generating 1M+ TDistribution
objects, as in my test case, doesn't help.
Benchmarkwise, after dropping the executions from 3M to 1M for my sanity's sake, the execution
performance improved from 9.873s to 9.600s with caching – a modest ~3% improvement. So
I agree rolling back was the right call.
The right approach, IMO, is for there to be some subsystem/class for lazily calculating and
storing these quasiconstants like these, and use those throughout the codebase as it made
sense. I might be interested in contributing something like that if you're interested –
what's the minimum JVM version you're targeting in Commons Math 4.0? I've also subscribed
to the Commons Developer ML.
I'm also attaching a picture of my VisualVM's Sampler with my little benchmark running, at
the end.
Oh, and lastly, thanks for looking at the issue and taking a stab at the solution! Always
nice, and sometimes a pleasant surprise, to work with people who care about their opensource
software.

!visualvm screenshot.png!
> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call
> 
>
> Key: MATH1441
> URL: https://issues.apache.org/jira/browse/MATH1441
> Project: Commons Math
> Issue Type: Improvement
> Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
> Reporter: Max Aller
> Priority: Minor
> Labels: performance
> Attachments: visualvm screenshot.png
>
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other of 100k
or 1M times), is surprisingly slow  3M calls, on my 3rd gen i7 machine, takes *75 seconds*.
This is primarily because recalculating the inverse cumulative probability, and reconstructing
the TDistribution object itself, is somewhat expensive, entailing a lot of `log` and `sqrt`
and iteration and all that stuff.
> The confidence interval is based on two values  *n* and *alpha*. I'd posit that alpha
will _usually_ be one of a very small set of values, and n, well, at least in my case, I'm
always calling this method with the same number of data points  a moving window of timeseries
data. But I recognize n might be all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit the users
of Commons Math. I've forked my own version of getSlopeConfidenceInterval that uses caching.
Here's a test case demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis()  start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis()  start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster  53ns/call. My particular solution is written in Kotlin for Java
8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha >= 1  alpha <= 0) {
> throw OutOfRangeException(
> LocalizedFormats.SIGNIFICANCE_LEVEL,
> alpha, 0, 1
> )
> }
> // No advertised NotStrictlyPositiveException here  will return NaN above
> // PATCH: use cached inverse cumulative probability
> return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap<Key, Double>()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
> cache.getOrPut(Key(n, alpha)) {
> TDistribution((n  2).toDouble()).inverseCumulativeProbability(1.0  alpha /
2.0)
> }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be handy here,
I believe. What are your thoughts?

This message was sent by Atlassian JIRA
(v7.6.3#76005)
