spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table
Date Mon, 28 Sep 2020 01:10:26 GMT
bump to see anyone interested or concerned about this

On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Hi devs,
>
> We have a capability check in DSv2 defining which operations can be done
> against the data source both read and write. The concept was brought in
> DSv2, so it's not weird for DSv1 to don't have a concept.
>
> In SS the problem arises - if I understand correctly, we would like to
> couple the output mode in the query and the output table. That said,
> complete mode should enforce the output table to truncate the content.
> Update mode should enforce the output table to "upsert" or "delete and
> append" the content.
>
> Nothing has been done against the DSv1 sink - Spark doesn't enforce
> anything and works as append mode, though the query still respects the
> output mode on stateful operations.
>
> I understand we don't want to make end users surprised on broken
> compatibility, but shouldn't it be an "temporary" "exceptional" case
> and DSv2 never does it again? I'm seeing many built-in data sources being
> migrated to DSv2 with the exception of "do nothing for update/truncate",
> which simply destruct the rationalization on capability.
>
> In addition, they don't add TRUNCATE in capability but add
> SupportsTruncate in WriteBuilder, which is weird. It works as of now
> because SS misses checking capability on the writer side (I guess it only
> checks STREAMING_WRITE), but once we check capability in first place,
> things will break.
> (I'm looking into adding a writer plan in SS before analyzer, and check
> capability there.)
>
> What would be our best fix on this issue? Would we leave the
> responsibility of handling "truncate" on the data source (so do nothing is
> fine if it's intended), and just add TRUNCATE to the capability? (That
> should be documented in its data source description though.) Or drop the
> support on truncate if the data source is unable to truncate? (Foreach and
> Kafka output tables will be unable to apply complete mode afterwards.)
>
> Looking forward to hear everyone's thoughts.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>

Mime
View raw message