spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Date Mon, 07 Oct 2019 22:19:45 GMT
Would you mind if I ask the condition of being public API? Source/Sink
traits are not marked as @DeveloperApi but they're defined as public, and
located to sql-core so even not semantically private (for catalyst), easy
to give a signal they're public APIs.

Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] is
not available even for private API. There're some other approaches on using
private API: 1) SQLContext.internalCreateDataFrame - as it requires
RDD[InternalRow], they should also depend on catalyst and have to deal with
InternalRow which Spark community seems to be desired to change it
eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
catalyst. So they not only need to apply "package hack" but also need to
depend on catalyst.


On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> AFAIK there is no public streaming data source API before DS v2. The
> Source and Sink API is private and is only for builtin streaming sources.
> Advanced users can still implement custom stream sources with private Spark
> APIs (you can put your classes under the org.apache.spark.sql package to
> access the private methods).
>
> That said, DS v2 is the first public streaming data source API. It's
> really hard to design a stable, efficient and flexible data source API that
> is unified between batch and streaming. DS v2 has evolved a lot in the
> master branch and hopefully there will be no big breaking changes anymore.
>
>
> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
> wrote:
>
>> I remembered the actual case from developer who implements custom data
>> source.
>>
>>
>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>>
>> Quoting here:
>> We started implementing DSv2 in the 2.4 branch, but quickly discovered
>> that the DSv2 in 3.0 was a complete breaking change (to the point where it
>> could have been named DSv3 and it wouldn’t have come as a surprise). Since
>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
>> to fall back into DSv1 in order to ease the future transition to Spark 3.
>>
>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
>> on dealing with DSv2 breaking change is having DSv1 as temporary solution,
>> even DSv2 for 3.x will be available. They need some time to make transition.
>>
>> I would file an issue to support streaming data source on DSv1 and submit
>> a patch unless someone objects.
>>
>>
>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <jacek@japila.pl> wrote:
>>
>>> Hi Jungtaek,
>>>
>>> Thanks a lot for your very prompt response!
>>>
>>> > Looks like it's missing, or intended to force custom streaming source
>>> implemented as DSv2.
>>>
>>> That's exactly my understanding = no more DSv1 data sources. That
>>> however is not consistent with the official message, is it? Spark 2.4.4
>>> does not actually say "we're abandoning DSv1", and people could not really
>>> want to jump on DSv2 since it's not recommended (unless I missed that).
>>>
>>> I love surprises (as that's where people pay more for consulting :)),
>>> but not necessarily before public talks (with one at SparkAISummit in two
>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://about.me/JacekLaskowski
>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>> The Internals of Spark Structured Streaming
>>> https://bit.ly/spark-structured-streaming
>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Looks like it's missing, or intended to force custom streaming source
>>>> implemented as DSv2.
>>>>
>>>> I'm not sure Spark community wants to expand DSv1 API: I could propose
>>>> the change if we get some supports here.
>>>>
>>>> To Spark community: given we bring major changes on DSv2, someone would
>>>> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
>>>> new DSv2 gets stabilized. Would we like to provide necessary changes on
>>>> DSv1?
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <jacek@japila.pl> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think I've got stuck and without your help I won't move any further.
>>>>> Please help.
>>>>>
>>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>>>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there
is
>>>>> this assert [1] I can't seem to go past with any DataFrame I managed
to
>>>>> create as it's not streaming.
>>>>>
>>>>>           assert(batch.isStreaming,
>>>>>             s"DataFrame returned by getBatch from $source did not have
>>>>> isStreaming=true\n" +
>>>>>               s"${batch.queryExecution.logical}")
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>>>>
>>>>> All I could find is private[sql],
>>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2]
or [3]
>>>>>
>>>>> [2]
>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>>>>> [3]
>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>>>>
>>>>> Pozdrawiam,
>>>>> Jacek Laskowski
>>>>> ----
>>>>> https://about.me/JacekLaskowski
>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>>> The Internals of Spark Structured Streaming
>>>>> https://bit.ly/spark-structured-streaming
>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>>
>>>>>

Mime
View raw message