How big is the overhead, at scale?
If it has a nontrivial effect for most jobs, I could imagine reusing
the existing approximate quantile support to more efficiently find a
prettyclose median.
On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim
<kabhwan.opensource@gmail.com> wrote:
>
> Hi Spark devs,
>
> The change might be specific to the SQLAppStatusListener, but given it may change the
value of metric being shown in UI, so would like to hear some voices on this.
>
> When we aggregate the SQL metric between tasks, we apply "sum", "min", "median", "max",
which all are cumulative except "median". That's different from "average" given it helps to
get rid of outliers, but if that's the only purpose, it may not strictly need to have exact
value of median.
>
> I'm not sure how much the value is losing the meaning of representation, but if it doesn't
hurt much, what about taking median of medians? For example, taking median of nearest 10 tasks
and store it as one of median values, and finally taking median of medians. If I calculate
correctly, that would only require 11% of slots if the number of tasks is 100, and replace
sorting 100 elements with sorting 10 elements 11 times. The difference would be bigger if
the number of tasks is bigger.
>
> Just a rough idea so any feedbacks are appreciated.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

To unsubscribe email: devunsubscribe@spark.apache.org
