spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
Date Thu, 07 Sep 2017 18:12:45 GMT
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rblue@netflix.com.invalid> wrote:

> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
> ​
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhovell@databricks.com> wrote:
>
>> +1 (binding)
>>
>> I personally believe that there is quite a big difference between having
>> a generic data source interface with a low surface area and pushing down a
>> significant part of query processing into a datasource. The later has much
>> wider wider surface area and will require us to stabilize most of the
>> internal catalyst API's which will be a significant burden on the community
>> to maintain and has the potential to slow development velocity
>> significantly. If you want to write such integrations then you should be
>> prepared to work with catalyst internals and own up to the fact that things
>> might change across minor versions (and in some cases even maintenance
>> releases). If you are willing to go down that road, then your best bet is
>> to use the already existing spark session extensions which will allow you
>> to write such integrations and can be used as an `escape hatch`.
>>
>>
>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <andrew@andrewash.com> wrote:
>>
>>> +0 (non-binding)
>>>
>>> I think there are benefits to unifying all the Spark-internal
>>> datasources into a common public API for sure.  It will serve as a forcing
>>> function to ensure that those internal datasources aren't advantaged vs
>>> datasources developed externally as plugins to Spark, and that all Spark
>>> features are available to all datasources.
>>>
>>> But I also think this read-path proposal avoids the more difficult
>>> questions around how to continue pushing datasource performance forwards.
>>> James Baker (my colleague) had a number of questions about advanced
>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>> well.  The Spark community saw similar requests, for aggregate pushdown in
>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>>> performance work for datasources.
>>>
>>> To leave enough space for datasource developers to continue
>>> experimenting with advanced interactions between Spark and their
>>> datasources, I'd propose we leave some sort of escape valve that enables
>>> these datasources to keep pushing the boundaries without forking Spark.
>>> Possibly that looks like an additional unsupported/unstable interface that
>>> pushes down an entire (unstable API) logical plan, which is expected to
>>> break API on every release.   (Spark attempts this full-plan pushdown, and
>>> if that fails Spark ignores it and continues on with the rest of the V2 API
>>> for compatibility).  Or maybe it looks like something else that we don't
>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>> API and instead should be a separate SPIP.
>>>
>>> If we had a plan for this kind of escape valve for advanced datasource
>>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>>> focused more on getting the basics right for what many datasources are
>>> already doing in API V1 combined with other private APIs, vs pushing
>>> forward state of the art for performance.
>>>
>>> Andrew
>>>
>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>> suresh.thalamati@gmail.com> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>>
>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> In the previous discussion, we decided to split the read and write path
>>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>>>> for Data Source V2 read path only.
>>>>
>>>> The full document of the Data Source API V2 is:
>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>
>>>> The ready-for-review PR that implements the basic infrastructure for
>>>> the read path is:
>>>> https://github.com/apache/spark/pull/19136
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Herman van Hövell
>>
>> Software Engineer
>>
>> Databricks Inc.
>>
>> hvanhovell@databricks.com
>>
>> +31 6 420 590 27
>>
>> databricks.com
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>>
>>
>> [image: Announcing Databricks Serverless. The first serverless data
>> science and big data platform. Watch the demo from Spark Summit 2017.]
>> <http://go.databricks.com/announcing-databricks-serverless>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
View raw message