spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Date Wed, 02 Oct 2019 07:08:04 GMT
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source
implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however
is not consistent with the official message, is it? Spark 2.4.4 does not
actually say "we're abandoning DSv1", and people could not really want to
jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but
not necessarily before public talks (with one at SparkAISummit in two
weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Looks like it's missing, or intended to force custom streaming source
> implemented as DSv2.
>
> I'm not sure Spark community wants to expand DSv1 API: I could propose the
> change if we get some supports here.
>
> To Spark community: given we bring major changes on DSv2, someone would
> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
> new DSv2 gets stabilized. Would we like to provide necessary changes on
> DSv1?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <jacek@japila.pl> wrote:
>
>> Hi,
>>
>> I think I've got stuck and without your help I won't move any further.
>> Please help.
>>
>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>> MicroBatch) and in getBatch phase when requested for a DataFrame, there is
>> this assert [1] I can't seem to go past with any DataFrame I managed to
>> create as it's not streaming.
>>
>>           assert(batch.isStreaming,
>>             s"DataFrame returned by getBatch from $source did not have
>> isStreaming=true\n" +
>>               s"${batch.queryExecution.logical}")
>>
>> [1]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>
>> All I could find is private[sql],
>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]
>>
>> [2]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>> [3]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>

Mime
View raw message