spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grega Kešpret <gr...@celtra.com>
Subject Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
Date Mon, 06 Apr 2015 07:50:24 GMT
Hi!

I'd like to get community's opinion on implementing a generic quantile
approximation algorithm for Spark that is O(n) and requires limited memory.
I would find it useful and I haven't found any existing implementation. The
plan was basically to wrap t-digest <https://github.com/tdunning/t-digest>,
implement the serialization/deserialization boilerplate and provide

def cdf(x: Double): Double
def quantile(q: Double): Double


on RDD[Double] and RDD[(K, Double)].

Let me know what you think. Any other ideas/suggestions also welcome!

Best,
Grega
--
[image: Inline image 1]*Grega Kešpret*
Senior Software Engineer, Analytics

Skype: gregakespret
celtra.com <http://www.celtra.com/> | @celtramobile
<http://www.twitter.com/celtramobile>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message