spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean R. Owen (Jira)" <>
Subject [jira] [Assigned] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
Date Tue, 08 Oct 2019 13:13:00 GMT


Sean R. Owen reassigned SPARK-29336:

    Assignee: Guilherme Souza

> The implementation of QuantileSummaries.merge  does not guarantee that the relativeError
will be respected 
> -----------------------------------------------------------------------------------------------------------
>                 Key: SPARK-29336
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: Guilherme Souza
>            Assignee: Guilherme Souza
>            Priority: Minor
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient quantile algorithm|]
in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is
some simple Scala code that reproduces the issue I've found:
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = => (i+1).toDouble / values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = => values.indexOf(ans)).toArray
>   val error ={ case (answer, expected) => Math.abs(expected
- answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% max error.
In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of
100). However, the output I observe is:
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the scenes, but irrelevant
to the core cause. (and sorry for my Scala, I'm not used to it)
> Interestingly enough, if I change from five to one partition the code works as expected
and gives 10 every time. This seems to point to some problem at the [merge logic|]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history)
suggest the published paper is not clear on how that should be done and, honestly, I was not
confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was unfortunately closed
with no fix applied.
> In my external implementation I believe to have found a sound way to implement the merge
method. [Here is my take in Rust, if relevant|]
> I'd be really glad to add unit tests and contribute my implementation adapted to Scala.
>  I'd love to hear your opinion on the matter.
> Best regards

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message