spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuanjian Li <>
Subject Re: What's the root cause of not supporting multiple aggregations in structured streaming?
Date Fri, 27 Nov 2020 03:07:56 GMT
Nice blog! Thanks for sharing, Etienne!

Let's try to raise this discussion again after the 3.1 release. I do think
more committers/contributors had realized the issue of global watermark per
SPARK-24634 <> and
SPARK-33259 <>.

Leaving some thoughts on my end:
1. Compatibility: The per-operation watermark should be compatible with the
original global one when there are no multi-aggregations.
2. Versioning: If we need to change checkpoints' format, versioning info
should be added for the first time.
3. Fix more things together: We'd better fix more issues(e.g. per-operation
output mode for multi-aggregations) together, which would require
versioning changes in the same Spark version.


Etienne Chauchot <> 于2020年11月26日周四 下午5:29写道:

> Hi,
> Regarding this subject I wrote a blog article that gives details about the
> watermark architecture proposal that was discussed in the design doc and in
> the PR:
> Best
> Etienne
> On 29/09/2020 03:24, Yuanjian Li wrote:
> Thanks for the great discussion!
> Also interested in this feature and did some investigation before. As Arun
> mentioned, similar to the "update" mode, the "complete" mode also needs
> more design. We might need an operation level output mode for the complete
> mode support. That is to say, if we use "complete" mode for every
> aggregation operators, the wrong result will return.
> SPARK-26655 would be a good start, which only considers about "append"
> mode. Maybe we need more discussion on the watermark interface. I will take
> a close look at the doc and PR. Hope we will have the first version with
> limitations and fix/remove them gradually.
> Best,
> Yuanjian
> Jungtaek Lim <> 于2020年9月26日周六 上午10:31写道:
>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>> sorry I forgot to send the reply (was in draft).
>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>> an individual. There might be various reasons to do so, most probably
>> "priority" among the stuff. There's not much I could change.
>> I agree the workaround is sub-optimal, but unless I see sufficient
>> support in the community I probably couldn't make it go forward. I'll just
>> say there's an elephant in the room - as the project goes forward for more
>> than 10 years, backward compatibility is a top priority concern in the
>> project, even across the major versions along the features/APIs. It is
>> great for end users to migrate the version easily, but also blocks devs to
>> fix the bad design once it ships. I'm the one complaining about these
>> issues in the dev list, and I don't see willingness to correct them.
>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <>
>> wrote:
>>> Hi Jungtaek Lim,
>>> Nice to hear from you again since last time we talked :) and congrats on
>>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>>> not at the time)
>>> I totally agree with what you're saying on merging structural parts of
>>> Spark without having a broader consensus. What I don't understand is why
>>> there is not more investment in SS. Especially because in another thread
>>> the community is discussing about deprecating the regular DStream streaming
>>> framework.
>>> Is the orientation of Spark now mostly batch ?
>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>> not sure it meets my needs as it will add delays and also may mess up with
>>> watermarks.
>>> Best
>>> Etienne Chauchot
>>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>> Unfortunately I don't see enough active committers working on Structured
>>> Streaming; I don't expect major features/improvements can be brought in
>>> this situation.
>>> Technically I can review and merge the PR on major improvements in SS,
>>> but that depends on how huge the proposal is changing. If the proposal
>>> brings conceptual change, being reviewed by a committer wouldn't still be
>>> enough.
>>> So that's not due to the fact we think it's worthless. (That might be
>>> only me though.) I'd understand as there's not much investment on SS.
>>> There's also a known workaround for multiple aggregations (I've documented
>>> in the SS guide doc, in "Limitation of global watermark" section), though I
>>> totally agree the workaround is bad.
>>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <>
>>> wrote:
>>>> Hi all,
>>>> I'm also very interested in this feature but the PR is open since
>>>> January 2019 and was not updated. It raised a design discussion around
>>>> watermarks and a design doc was written (
>>>> We also commented this design but no matter what it seems that the subject
>>>> is still stale.
>>>> Is there any interest in the community in delivering this feature or is
>>>> it considered worthless ? If the latter, can you explain why ?
>>>> Best
>>>> Etienne
>>>> On 22/05/2019 03:38, 张万新 wrote:
>>>> Thanks, I'll check it out.
>>>> Arun Mahadevan <> 于 2019年5月21日周二 01:31写道:
>>>>> Heres the proposal for supporting it in "append" mode -
>>>>> You could see if it
>>>>> addresses your requirement and post your feedback in the PR.
>>>>> For "update" mode its going to be much harder to support this without
>>>>> first adding support for "retractions", otherwise we would end up with
>>>>> wrong results.
>>>>> - Arun
>>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <>
>>>>> wrote:
>>>>>> There is PR for this but not yet merged.
>>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <>
>>>>>>> Hi there,
>>>>>>> I'd like to know what's the root reason why multiple aggregations
>>>>>>> streaming dataframe is not allowed since it's a very useful feature,
>>>>>>> flink has supported it for a long time.
>>>>>>> Thanks.

View raw message