spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Baker <>
Subject Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2
Date Tue, 29 Aug 2017 02:54:29 GMT
Copying from the code review comments I just submitted on the draft API (

Context here is that I've spent some time implementing a Spark datasource and have had some
issues with the current API which are made worse in V2.

The general conclusion I’ve come to here is that this is very hard to actually implement
(in a similar but more aggressive way than DataSource V1, because of the extra methods and
dimensions we get in V2).

In DataSources V1 PrunedFilteredScan, the issue is that you are passed in the filters with
the buildScan method, and then passed in again with the unhandledFilters method.

However, the filters that you can’t handle might be data dependent, which the current API
does not handle well. Suppose I can handle filter A some of the time, and filter B some of
the time. If I’m passed in both, then either A and B are unhandled, or A, or B, or neither.
The work I have to do to work this out is essentially the same as I have to do while actually
generating my RDD (essentially I have to generate my partitions), so I end up doing some weird
caching work.

This V2 API proposal has the same issues, but perhaps moreso. In PrunedFilteredScan, there
is essentially one degree of freedom for pruning (filters), so you just have to implement
caching between unhandledFilters and buildScan. However, here we have many degrees of freedom;
sorts, individual filters, clustering, sampling, maybe aggregations eventually - and these
operations are not all commutative, and computing my support one-by-one can easily end up
being more expensive than computing all in one go.

For some trivial examples:

- After filtering, I might be sorted, whilst before filtering I might not be.

- Filtering with certain filters might affect my ability to push down others.

- Filtering with aggregations (as mooted) might not be possible to push down.

And with the API as currently mooted, I need to be able to go back and change my results because
they might change later.

Really what would be good here is to pass all of the filters and sorts etc all at once, and
then I return the parts I can’t handle.

I’d prefer in general that this be implemented by passing some kind of query plan to the
datasource which enables this kind of replacement. Explicitly don’t want to give the whole
query plan - that sounds painful - would prefer we push down only the parts of the query plan
we deem to be stable. With the mix-in approach, I don’t think we can guarantee the properties
we want without a two-phase thing - I’d really love to be able to just define a straightforward
union type which is our supported pushdown stuff, and then the user can transform and return

I think this ends up being a more elegant API for consumers, and also far more intuitive.


On Mon, 28 Aug 2017 at 18:00 蒋星博 <<>>
+1 (Non-binding)

Xiao Li <<>>于2017年8月28日

2017-08-28 12:45 GMT-07:00 Cody Koeninger <<>>:
Just wanted to point out that because the jira isn't labeled SPIP, it
won't have shown up linked from

On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <<>>
> Hi all,
> It has been almost 2 weeks since I proposed the data source V2 for
> discussion, and we already got some feedbacks on the JIRA ticket and the
> prototype PR, so I'd like to call for a vote.
> The full document of the Data Source API V2 is:
> Note that, this vote should focus on high-level design/framework, not
> specified APIs, as we can always change/improve specified APIs during
> development.
> The vote will be up for the next 72 hours. Please reply with your vote:
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
> Thanks!

To unsubscribe e-mail:<>

View raw message