Yep, that's clear. That's a reasonable case. There are already
approximate median computations that can be done cumulatively as you
say, implemented in Spark. I think it's reasonable to consider this
for performance, as it can be faster with just a small error
tolerance. But yeah up to you if you have better ideas.
On Wed, Nov 27, 2019 at 7:57 PM Jungtaek Lim
<kabhwan.opensource@gmail.com> wrote:
>
> Thanks all for providing inputs! Maybe I wasn't clear about my intention.
>
> The issue I focus on is; there're plenty of metrics being defined in a stage for SQL,
and each metric has values for each task and being grouped later to calculate aggregated values.
(e.g. metric for "elapsed time" is shown in UI as sum, min, med, max  which source values
come from each task)
>
> Due to the nature of exact calculation of "median", we can't apply accumulation  we
are now storing all values for all metrics till the end of stage. Given the default value
of sql shuffle partition is 200, a stage would have 200 tasks when we deal with shuffle (grouping,
join, etc.). If we have 50 metrics in a stage, 10000 Long values are maintained in driver
side which may ideally just need to be 50 * number of aggregation (at most 4) if all of aggregations
support accumulation. So I'm wondering something which could support accumulation and closer
to median. (I guess it's intentional to not take average here so...)
>
> What's more on SQLAppStatusListener, they're calculated altogether at the end of SQL
execution, which may contain multiple jobs.
> (Oh wait... Hmm... Looks like I missed the another point of optimization here which might
mitigate the issue heavily... so please treat my idea as rough idea just for possible optimization.)
>
> But again that's very rough idea, and it won't make sense if the expected output is not
acceptable as representation.
>
> Jungtaek Lim (HeartSaVioR)
>
>
> On Wed, Nov 27, 2019 at 11:25 PM Sean Owen <srowen@gmail.com> wrote:
>>
>> How big is the overhead, at scale?
>> If it has a nontrivial effect for most jobs, I could imagine reusing
>> the existing approximate quantile support to more efficiently find a
>> prettyclose median.
>>
>> On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim
>> <kabhwan.opensource@gmail.com> wrote:
>> >
>> > Hi Spark devs,
>> >
>> > The change might be specific to the SQLAppStatusListener, but given it may change
the value of metric being shown in UI, so would like to hear some voices on this.
>> >
>> > When we aggregate the SQL metric between tasks, we apply "sum", "min", "median",
"max", which all are cumulative except "median". That's different from "average" given it
helps to get rid of outliers, but if that's the only purpose, it may not strictly need to
have exact value of median.
>> >
>> > I'm not sure how much the value is losing the meaning of representation, but
if it doesn't hurt much, what about taking median of medians? For example, taking median of
nearest 10 tasks and store it as one of median values, and finally taking median of medians.
If I calculate correctly, that would only require 11% of slots if the number of tasks is 100,
and replace sorting 100 elements with sorting 10 elements 11 times. The difference would be
bigger if the number of tasks is bigger.
>> >
>> > Just a rough idea so any feedbacks are appreciated.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)

To unsubscribe email: devunsubscribe@spark.apache.org
