spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigerquoll <>
Subject [Discuss] Datasource v2 support for manipulating partitions
Date Sun, 16 Sep 2018 06:24:28 GMT
I've been following the development of the new data source abstraction with
keen interest.  One of the issues that has occurred to me as I sat down and
planned how I would implement a data source is how I would support
manipulating partitions.

My reading of the current prototype is that Data source v2 APIs expose
enough of a concept of a partition to support communicating record
distribution particulars to catalyst, but does not represent partitions as a
concept that the end user of the data sources can manipulate.

The end users of data sources need to be able to add/drop/modify and list
partitions. For example, many systems require partitions to be created
before records are added to them.  

For batch use-cases, it may be possible for users to manipulate partitions
from within the environment that the data source interfaces to, but for
streaming use-cases, this is not at all practical.

Two ways I can think of doing this are:
1. Allow "pass-through" commands to the underlying data source
2. Have a generic concept of partitions exposed to the end user via the data
source API and Spark SQL DML.

I'm keen for option 2 but recognise that its possible there are better
alternatives out there.

Sent from:

To unsubscribe e-mail:

View raw message