spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)
Date Thu, 06 Oct 2016 19:37:56 GMT
Fred, I think thats a pretty good summary of my thoughts.  Thanks for
condensing them :)

Right now, my focus is to get more people using Structured Streaming so
that we can get some real world feedback on what is missing.  Right now
this means:
 - SPARK-15406 <https://issues.apache.org/jira/browse/SPARK-15406> Kafka
Support - since this seems to be the source of choice for many users
 - SPARK-17731 <https://issues.apache.org/jira/browse/SPARK-17731> Metrics
- right now its pretty hard to see what is going on, and where latency is
coming from.

Once those are in and see some use, I think it'll be easier to prioritize
the work on #1.

Relatedly, I'm curious to hear more about the types of questions you are
getting.  I think the dev list is a good place to discuss applications and
if/how structured streaming can handle them.

On Wed, Oct 5, 2016 at 3:20 PM, Fred Reiss <freiss.oss@gmail.com> wrote:

> Thanks for the thoughtful comments, Michael and Shivaram. From what I’ve
> seen in this thread and on JIRA, it looks like the current plan with regard
> to application-facing APIs for sinks is roughly:
> 1. Rewrite incremental query compilation for Structured Streaming.
> 2. Redesign Structured Streaming's source and sink APIs so that they do
> not depend on RDDs.
> 3. Allow the new APIs to stabilize.
> 4. Open these APIs to use by application code.
>
> Is there a way for those of us who aren’t involved in the first two steps
> to get some idea of the current plans and progress? I get asked a lot about
> when Structured Streaming will be a viable replacement for Spark Streaming,
> and I like to be able to give accurate advice.
>
> Fred
>
> On Tue, Oct 4, 2016 at 3:02 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> I don't quite understand why exposing it indirectly through a typed
>>> interface should be delayed before finalizing the API.
>>>
>>
>> Spark has a long history
>> <https://spark-project.atlassian.net/browse/SPARK-1094> of maintaining
>> binary compatibility in its public APIs.  I strongly believe this is one of
>> the things that has made the project successful.  Exposing internals that
>> we know are going to change in the primary user facing API for creating
>> Streaming DataFrames seems directly counter to this goal.  I think the
>> argument that "you can do it anyway" fails to capture user expectations who
>> probably aren't closely following this discussion.
>>
>> If advanced users want to dig though the code and experiment, great.  I
>> hope they report back on whats good and what can be improved.  However, if
>> you add the function suggested in the PR to DataStreamReader, you are
>> giving them a bad experience by leaking internals that don't even show up
>> in the published documentation.
>>
>
>

Mime
View raw message