spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Loose the requirement of "median" of the SQL metrics
Date Wed, 27 Nov 2019 10:20:27 GMT
Another option could be to use a sketch to get approx median(extendable to
quantiles as well) for a large number of tasks sketch would give accurate
value as tasks are few, for larger task the benefit will be good.
Regards,
Mayur Rustagi
Ph: +1 (650) 937 9673
http://www.sigmoid.com <http://www.sigmoidanalytics.com/>
@mayur_rustagi <http://www.twitter.com/mayur_rustagi>


On Wed, Nov 27, 2019 at 3:25 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Hi Spark devs,
>
> The change might be specific to the SQLAppStatusListener, but given it may
> change the value of metric being shown in UI, so would like to hear some
> voices on this.
>
> When we aggregate the SQL metric between tasks, we apply "sum", "min",
> "median", "max", which all are cumulative except "median". That's different
> from "average" given it helps to get rid of outliers, but if that's the
> only purpose, it may not strictly need to have exact value of median.
>
> I'm not sure how much the value is losing the meaning of representation,
> but if it doesn't hurt much, what about taking median of medians? For
> example, taking median of nearest 10 tasks and store it as one of median
> values, and finally taking median of medians. If I calculate correctly,
> that would only require 11% of slots if the number of tasks is 100, and
> replace sorting 100 elements with sorting 10 elements 11 times. The
> difference would be bigger if the number of tasks is bigger.
>
> Just a rough idea so any feedbacks are appreciated.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>

Mime
View raw message