spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grega Kešpret <>
Subject Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
Date Mon, 06 Apr 2015 07:50:24 GMT

I'd like to get community's opinion on implementing a generic quantile
approximation algorithm for Spark that is O(n) and requires limited memory.
I would find it useful and I haven't found any existing implementation. The
plan was basically to wrap t-digest <>,
implement the serialization/deserialization boilerplate and provide

def cdf(x: Double): Double
def quantile(q: Double): Double

on RDD[Double] and RDD[(K, Double)].

Let me know what you think. Any other ideas/suggestions also welcome!

[image: Inline image 1]*Grega Kešpret*
Senior Software Engineer, Analytics

Skype: gregakespret <> | @celtramobile

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message