spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Date Thu, 10 Oct 2019 11:44:44 GMT
Hi,

Thanks much for such thorough conversation. Enjoyed it very much.

> Source/Sink traits are in org.apache.spark.sql.execution and thus they
are private.

That would explain why I couldn't find scaladocs.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan <cloud0fan@gmail.com> wrote:

> > Would you mind if I ask the condition of being public API?
>
> The module doesn't matter, but the package matters. We have many public
> APIs in the catalyst module as well. (e.g. DataType)
>
> There are 3 packages in Spark SQL that are meant to be private:
> 1. org.apache.spark.sql.catalyst
> 2. org.apache.spark.sql.execution
> 3. org.apache.spark.sql.internal
>
> You can check out the full list of private packages of Spark in
> project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages
>
> Basically, classes/interfaces that don't appear in the official Spark API
> doc are private.
>
> Source/Sink traits are in org.apache.spark.sql.execution and thus they are
> private.
>
> On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <kabhwan.opensource@gmail.com>
> wrote:
>
>> Would you mind if I ask the condition of being public API? Source/Sink
>> traits are not marked as @DeveloperApi but they're defined as public, and
>> located to sql-core so even not semantically private (for catalyst), easy
>> to give a signal they're public APIs.
>>
>> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row]
>> is not available even for private API. There're some other approaches on
>> using private API: 1) SQLContext.internalCreateDataFrame - as it requires
>> RDD[InternalRow], they should also depend on catalyst and have to deal with
>> InternalRow which Spark community seems to be desired to change it
>> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
>> catalyst. So they not only need to apply "package hack" but also need to
>> depend on catalyst.
>>
>>
>> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> AFAIK there is no public streaming data source API before DS v2. The
>>> Source and Sink API is private and is only for builtin streaming sources.
>>> Advanced users can still implement custom stream sources with private Spark
>>> APIs (you can put your classes under the org.apache.spark.sql package to
>>> access the private methods).
>>>
>>> That said, DS v2 is the first public streaming data source API. It's
>>> really hard to design a stable, efficient and flexible data source API that
>>> is unified between batch and streaming. DS v2 has evolved a lot in the
>>> master branch and hopefully there will be no big breaking changes anymore.
>>>
>>>
>>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> I remembered the actual case from developer who implements custom data
>>>> source.
>>>>
>>>>
>>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>>>>
>>>> Quoting here:
>>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered
>>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it
>>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since
>>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
>>>> to fall back into DSv1 in order to ease the future transition to Spark 3.
>>>>
>>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
>>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution,
>>>> even DSv2 for 3.x will be available. They need some time to make transition.
>>>>
>>>> I would file an issue to support streaming data source on DSv1 and
>>>> submit a patch unless someone objects.
>>>>
>>>>
>>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <jacek@japila.pl> wrote:
>>>>
>>>>> Hi Jungtaek,
>>>>>
>>>>> Thanks a lot for your very prompt response!
>>>>>
>>>>> > Looks like it's missing, or intended to force custom streaming
>>>>> source implemented as DSv2.
>>>>>
>>>>> That's exactly my understanding = no more DSv1 data sources. That
>>>>> however is not consistent with the official message, is it? Spark 2.4.4
>>>>> does not actually say "we're abandoning DSv1", and people could not really
>>>>> want to jump on DSv2 since it's not recommended (unless I missed that).
>>>>>
>>>>> I love surprises (as that's where people pay more for consulting :)),
>>>>> but not necessarily before public talks (with one at SparkAISummit in
two
>>>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>>>>>
>>>>> Pozdrawiam,
>>>>> Jacek Laskowski
>>>>> ----
>>>>> https://about.me/JacekLaskowski
>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>>> The Internals of Spark Structured Streaming
>>>>> https://bit.ly/spark-structured-streaming
>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Looks like it's missing, or intended to force custom streaming source
>>>>>> implemented as DSv2.
>>>>>>
>>>>>> I'm not sure Spark community wants to expand DSv1 API: I could
>>>>>> propose the change if we get some supports here.
>>>>>>
>>>>>> To Spark community: given we bring major changes on DSv2, someone
>>>>>> would want to rely on DSv1 while transition from old DSv2 to new
DSv2
>>>>>> happens and new DSv2 gets stabilized. Would we like to provide necessary
>>>>>> changes on DSv1?
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <jacek@japila.pl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think I've got stuck and without your help I won't move any
>>>>>>> further. Please help.
>>>>>>>
>>>>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>>>>>>> MicroBatch) and in getBatch phase when requested for a DataFrame,
there is
>>>>>>> this assert [1] I can't seem to go past with any DataFrame I
managed to
>>>>>>> create as it's not streaming.
>>>>>>>
>>>>>>>           assert(batch.isStreaming,
>>>>>>>             s"DataFrame returned by getBatch from $source did
not
>>>>>>> have isStreaming=true\n" +
>>>>>>>               s"${batch.queryExecution.logical}")
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>>>>>>
>>>>>>> All I could find is private[sql],
>>>>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true)
[2] or [3]
>>>>>>>
>>>>>>> [2]
>>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>>>>>>> [3]
>>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>>>>>>
>>>>>>> Pozdrawiam,
>>>>>>> Jacek Laskowski
>>>>>>> ----
>>>>>>> https://about.me/JacekLaskowski
>>>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>>>>> The Internals of Spark Structured Streaming
>>>>>>> https://bit.ly/spark-structured-streaming
>>>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>>>>
>>>>>>>

Mime
View raw message