spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Etienne Chauchot <echauc...@apache.org>
Subject Re: What's the root cause of not supporting multiple aggregations in structured streaming?
Date Thu, 26 Nov 2020 09:29:44 GMT
Hi,

Regarding this subject I wrote a blog article that gives details about 
the watermark architecture proposal that was discussed in the design doc 
and in the PR:

https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html

Best

Etienne

On 29/09/2020 03:24, Yuanjian Li wrote:
> Thanks for the great discussion!
>
> Also interested in this feature and did some investigation before. As 
> Arun mentioned, similar to the "update" mode, the "complete" mode also 
> needs more design. We might need an operation level output mode for 
> the complete mode support. That is to say, if we use "complete" mode 
> for every aggregation operators, the wrong result will return.
>
> SPARK-26655 would be a good start, which only considers about "append" 
> mode. Maybe we need more discussion on the watermark interface. I will 
> take a close look at the doc and PR. Hope we will have the first 
> version with limitations and fix/remove them gradually.
>
> Best,
> Yuanjian
>
> Jungtaek Lim <kabhwan.opensource@gmail.com 
> <mailto:kabhwan.opensource@gmail.com>> 于2020年9月26日周六 上午10:31写道:
>
>     Thanks Etienne! Yeah I forgot to say nice talking with you again.
>     And sorry I forgot to send the reply (was in draft).
>
>     Regarding investment in SS, well, unfortunately I don't know - I'm
>     just an individual. There might be various reasons to do so, most
>     probably "priority" among the stuff. There's not much I could change.
>
>     I agree the workaround is sub-optimal, but unless I see sufficient
>     support in the community I probably couldn't make it go forward.
>     I'll just say there's an elephant in the room - as the project
>     goes forward for more than 10 years, backward compatibility is a
>     top priority concern in the project, even across the major
>     versions along the features/APIs. It is great for end users to
>     migrate the version easily, but also blocks devs to fix the bad
>     design once it ships. I'm the one complaining about these issues
>     in the dev list, and I don't see willingness to correct them.
>
>
>     On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot
>     <echauchot@apache.org <mailto:echauchot@apache.org>> wrote:
>
>         Hi Jungtaek Lim,
>
>         Nice to hear from you again since last time we talked :) and
>         congrats on becoming a Spark committer in the meantime ! (if
>         I'm not mistaking you were not at the time)
>
>         I totally agree with what you're saying on merging structural
>         parts of Spark without having a broader consensus. What I
>         don't understand is why there is not more investment in SS.
>         Especially because in another thread the community is
>         discussing about deprecating the regular DStream streaming
>         framework.
>
>         Is the orientation of Spark now mostly batch ?
>
>         PS: yeah I saw your update on the doc when I took a look at
>         3.0 preview 2 searching for this particular feature. And
>         regarding the workaround, I'm not sure it meets my needs as it
>         will add delays and also may mess up with watermarks.
>
>         Best
>
>         Etienne Chauchot
>
>
>         On 04/09/2020 08:06, Jungtaek Lim wrote:
>>         Unfortunately I don't see enough active committers working on
>>         Structured Streaming; I don't expect major
>>         features/improvements can be brought in this situation.
>>
>>         Technically I can review and merge the PR on major
>>         improvements in SS, but that depends on how huge the proposal
>>         is changing. If the proposal brings conceptual change, being
>>         reviewed by a committer wouldn't still be enough.
>>
>>         So that's not due to the fact we think it's worthless. (That
>>         might be only me though.) I'd understand as there's not much
>>         investment on SS. There's also a known workaround for
>>         multiple aggregations (I've documented in the SS guide doc,
>>         in "Limitation of global watermark" section), though I
>>         totally agree the workaround is bad.
>>
>>         On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot
>>         <echauchot@apache.org <mailto:echauchot@apache.org>> wrote:
>>
>>             Hi all,
>>
>>             I'm also very interested in this feature but the PR is
>>             open since January 2019 and was not updated. It raised a
>>             design discussion around watermarks and a design doc was
>>             written
>>             (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>             We also commented this design but no matter what it seems
>>             that the subject is still stale.
>>
>>             Is there any interest in the community in delivering this
>>             feature or is it considered worthless ? If the latter,
>>             can you explain why ?
>>
>>             Best
>>
>>             Etienne
>>
>>             On 22/05/2019 03:38, 张万新 wrote:
>>>             Thanks, I'll check it out.
>>>
>>>             Arun Mahadevan <arunm@apache.org
>>>             <mailto:arunm@apache.org>> 于 2019年5月21日周二 01:31写道:
>>>
>>>                 Heres the proposal for supporting it in "append"
>>>                 mode - https://github.com/apache/spark/pull/23576.
>>>                 You could see if it addresses your requirement and
>>>                 post your feedback in the PR.
>>>                 For "update" mode its going to be much harder to
>>>                 support this without first adding support for
>>>                 "retractions", otherwise we would end up with wrong
>>>                 results.
>>>
>>>                 - Arun
>>>
>>>
>>>                 On Mon, 20 May 2019 at 01:34, Gabor Somogyi
>>>                 <gabor.g.somogyi@gmail.com
>>>                 <mailto:gabor.g.somogyi@gmail.com>> wrote:
>>>
>>>                     There is PR for this but not yet merged.
>>>
>>>                     On Mon, May 20, 2019 at 10:13 AM 张万新
>>>                     <kevinzwx1992@gmail.com
>>>                     <mailto:kevinzwx1992@gmail.com>> wrote:
>>>
>>>                         Hi there,
>>>
>>>                         I'd like to know what's the root reason why
>>>                         multiple aggregations on streaming dataframe
>>>                         is not allowed since it's a very useful
>>>                         feature, and flink has supported it for a
>>>                         long time.
>>>
>>>                         Thanks.
>>>

Mime
View raw message