spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala
Date Mon, 03 Oct 2016 08:39:02 GMT
+Matei for question about the source of this bit of code

That's a good question; I remember wondering about this once upon a time.

First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
code at this point. GroupedCountEvaluator is still used.

MeanEvaluator is a better example, because it's straightforward. It's
getting a confidence interval on the true mean from the sample stats
using a t-distribution.

CountEvaluator however I don't quite get ...

val p = outputsMerged.toDouble / totalOutputs
val mean = (sum + 1 - p) / p
val variance = (sum + 1) * (1 - p) / (p * p)
val stdev = math.sqrt(variance)
val confFactor = new NormalDistribution().
  inverseCumulativeProbability(1 - (1 - confidence) / 2)
val low = mean - confFactor * stdev
val high = mean + confFactor * stdev
new BoundedDouble(mean, confidence, low, high)

Given the mean/variance formula, this looks like it's modeling the
rest of the count not yet observed as negative binomial. However I'd
expect the mean to be something like sum * (1-p) / p instead, and
that's not the mean of the total count but the rest of the count then.
This also doesn't truncate the interval at 0 (count can't be
negative), which could also be solved by not using a normal
approximation too.

Of course, I could be totally wrong about the model then.

This is pretty old code, from years ago. Matei you at least merged it
though may not have written it -- do you have any more info?

I might start writing some tests for this since the result isn't
directly tested in the unit tests, to see how it holds up.

Then, the question is SumEvaluator, which seems to approach the
estimation as an estimation of product of count and mean together, so
it has some related questions.

Sean

On Sun, Oct 2, 2016 at 8:47 PM, philipghu <philguangyan@gmail.com> wrote:
> Hi,
>
>
> I've been struggling to understand the statistical theory behind this piece
> of code (from
> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
> below, especially with respect  to estimating the  size of the population
> (total tasks) and its variance. Also I'm trying to understand how the
> variance  of the sum is calculated like that. I'm struggling to find the
> source online too.
>
>  while (iter.hasNext) {
>         val entry = iter.next()
>         val counter = entry.getValue
>         val meanEstimate = counter.mean
>         val meanVar = counter.sampleVariance / counter.count
>         val countEstimate = (counter.count + 1 - p) / p
>         val countVar = (counter.count + 1) * (1 - p) / (p * p)
>         val sumEstimate = meanEstimate * countEstimate
>         val sumVar = (meanEstimate * meanEstimate * countVar) +
>                      (countEstimate * countEstimate * meanVar) +
>                      (meanVar * countVar)
>         val sumStdev = math.sqrt(sumVar)
>         val confFactor = studentTCacher.get(counter.count)
>         val low = sumEstimate - confFactor * sumStdev
>         val high = sumEstimate + confFactor * sumStdev
>         result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
> low, high))
>       }
>
> Thanks and best regards,
> Phil
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message