spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Prus <vladimir.p...@gmail.com>
Subject State of datasource api v2
Date Mon, 14 Jan 2019 08:48:16 GMT
Hi,

I am trying to understand the state of datasource v2, and I'm a bit lost.
On one hand, it is supposed to be more flexible approach, as described for
example here:

    https://www.slideshare.net/databricks/apache-spark-data-source
-v2-with-wenchen-fan-and-gengliang-wang

On another hand, it appears both Parquet and ORC file readers are still not
using v2 interface. There's an umbrella issue to address that:

    https://issues.apache.org/jira/browse/SPARK-23507

but it does not have any sub-issues to address Parquet and the issue about
ORC:

    https://issues.apache.org/jira/browse/SPARK-23817

includes this text: "Not supported( due to limitation of data source V2):
(1) Read multiple file path (2) Read bucketed file.".

Is there some up-to-date information whether datasource v2 will indeed
become to primary datasource, whether parquet reader
will be converted to V2, and whether these limitations above will be fixed.

Thanks in advance,

-- 
Vladimir Prus
http://vladimirprus.com

Mime
View raw message