spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark
Date Wed, 10 Jun 2015 21:54:46 GMT
This email is good. Just one note -- a lot of people are swamped right
before Spark Summit, so you might not get prompt responses this week.


On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <grega@celtra.com> wrote:

> I have some time to work on it now. What's a good way to continue the
> discussions before coding it?
>
> This e-mail list, JIRA or something else?
>
> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rxin@databricks.com> wrote:
>
>> I think those are great to have. I would put them in the DataFrame API
>> though, since this is applying to structured data. Many of the advanced
>> functions on the PairRDDFunctions should really go into the DataFrame API
>> now we have it.
>>
>> One thing that would be great to understand is what state-of-the-art
>> alternatives are out there. I did a quick google scholar search using the
>> keyword "approximate quantile" and found some older papers. Just the
>> first few I found:
>>
>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>  by Bruce Lindsay, IBM
>>
>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>
>>
>>
>>
>>
>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <grega@celtra.com> wrote:
>>
>>> Hi!
>>>
>>> I'd like to get community's opinion on implementing a generic quantile
>>> approximation algorithm for Spark that is O(n) and requires limited memory.
>>> I would find it useful and I haven't found any existing implementation. The
>>> plan was basically to wrap t-digest
>>> <https://github.com/tdunning/t-digest>, implement the
>>> serialization/deserialization boilerplate and provide
>>>
>>> def cdf(x: Double): Double
>>> def quantile(q: Double): Double
>>>
>>>
>>> on RDD[Double] and RDD[(K, Double)].
>>>
>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>
>>> Best,
>>> Grega
>>> --
>>> [image: Inline image 1]*Grega Kešpret*
>>> Senior Software Engineer, Analytics
>>>
>>> Skype: gregakespret
>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>> <http://www.twitter.com/celtramobile>
>>>
>>>
>>
>

Mime
View raw message