spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Etienne Chauchot <>
Subject Re: What's the root cause of not supporting multiple aggregations in structured streaming?
Date Thu, 26 Nov 2020 09:29:44 GMT

Regarding this subject I wrote a blog article that gives details about 
the watermark architecture proposal that was discussed in the design doc 
and in the PR:



On 29/09/2020 03:24, Yuanjian Li wrote:
> Thanks for the great discussion!
> Also interested in this feature and did some investigation before. As 
> Arun mentioned, similar to the "update" mode, the "complete" mode also 
> needs more design. We might need an operation level output mode for 
> the complete mode support. That is to say, if we use "complete" mode 
> for every aggregation operators, the wrong result will return.
> SPARK-26655 would be a good start, which only considers about "append" 
> mode. Maybe we need more discussion on the watermark interface. I will 
> take a close look at the doc and PR. Hope we will have the first 
> version with limitations and fix/remove them gradually.
> Best,
> Yuanjian
> Jungtaek Lim < 
> <>> 于2020年9月26日周六 上午10:31写道:
>     Thanks Etienne! Yeah I forgot to say nice talking with you again.
>     And sorry I forgot to send the reply (was in draft).
>     Regarding investment in SS, well, unfortunately I don't know - I'm
>     just an individual. There might be various reasons to do so, most
>     probably "priority" among the stuff. There's not much I could change.
>     I agree the workaround is sub-optimal, but unless I see sufficient
>     support in the community I probably couldn't make it go forward.
>     I'll just say there's an elephant in the room - as the project
>     goes forward for more than 10 years, backward compatibility is a
>     top priority concern in the project, even across the major
>     versions along the features/APIs. It is great for end users to
>     migrate the version easily, but also blocks devs to fix the bad
>     design once it ships. I'm the one complaining about these issues
>     in the dev list, and I don't see willingness to correct them.
>     On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot
>     < <>> wrote:
>         Hi Jungtaek Lim,
>         Nice to hear from you again since last time we talked :) and
>         congrats on becoming a Spark committer in the meantime ! (if
>         I'm not mistaking you were not at the time)
>         I totally agree with what you're saying on merging structural
>         parts of Spark without having a broader consensus. What I
>         don't understand is why there is not more investment in SS.
>         Especially because in another thread the community is
>         discussing about deprecating the regular DStream streaming
>         framework.
>         Is the orientation of Spark now mostly batch ?
>         PS: yeah I saw your update on the doc when I took a look at
>         3.0 preview 2 searching for this particular feature. And
>         regarding the workaround, I'm not sure it meets my needs as it
>         will add delays and also may mess up with watermarks.
>         Best
>         Etienne Chauchot
>         On 04/09/2020 08:06, Jungtaek Lim wrote:
>>         Unfortunately I don't see enough active committers working on
>>         Structured Streaming; I don't expect major
>>         features/improvements can be brought in this situation.
>>         Technically I can review and merge the PR on major
>>         improvements in SS, but that depends on how huge the proposal
>>         is changing. If the proposal brings conceptual change, being
>>         reviewed by a committer wouldn't still be enough.
>>         So that's not due to the fact we think it's worthless. (That
>>         might be only me though.) I'd understand as there's not much
>>         investment on SS. There's also a known workaround for
>>         multiple aggregations (I've documented in the SS guide doc,
>>         in "Limitation of global watermark" section), though I
>>         totally agree the workaround is bad.
>>         On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot
>>         < <>> wrote:
>>             Hi all,
>>             I'm also very interested in this feature but the PR is
>>             open since January 2019 and was not updated. It raised a
>>             design discussion around watermarks and a design doc was
>>             written
>>             (
>>             We also commented this design but no matter what it seems
>>             that the subject is still stale.
>>             Is there any interest in the community in delivering this
>>             feature or is it considered worthless ? If the latter,
>>             can you explain why ?
>>             Best
>>             Etienne
>>             On 22/05/2019 03:38, 张万新 wrote:
>>>             Thanks, I'll check it out.
>>>             Arun Mahadevan <
>>>             <>> 于 2019年5月21日周二 01:31写道:
>>>                 Heres the proposal for supporting it in "append"
>>>                 mode -
>>>                 You could see if it addresses your requirement and
>>>                 post your feedback in the PR.
>>>                 For "update" mode its going to be much harder to
>>>                 support this without first adding support for
>>>                 "retractions", otherwise we would end up with wrong
>>>                 results.
>>>                 - Arun
>>>                 On Mon, 20 May 2019 at 01:34, Gabor Somogyi
>>>                 <
>>>                 <>> wrote:
>>>                     There is PR for this but not yet merged.
>>>                     On Mon, May 20, 2019 at 10:13 AM 张万新
>>>                     <
>>>                     <>> wrote:
>>>                         Hi there,
>>>                         I'd like to know what's the root reason why
>>>                         multiple aggregations on streaming dataframe
>>>                         is not allowed since it's a very useful
>>>                         feature, and flink has supported it for a
>>>                         long time.
>>>                         Thanks.

View raw message