spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <>
Subject Re: [Discuss] Datasource v2 support for manipulating partitions
Date Mon, 17 Sep 2018 01:38:36 GMT
I am not involved with the design or development of the V2 API - so these could be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little more intimate
knowledge about Spark internals, I am guessing the absence of partition operations is either
due to no current need for it or the need to abstract that away from the API user/programmer.
Ofcourse, the other thing about dataset is that it comes with schema - closely binding the
two together.

However for situations where a deeper understanding of the datasource is necessary to read/write
data, then that logic can potentially be embedded into the classes that implement DataSourceReader/DataSourceWriter
or DataReader/DataWriter.

E.g. if you are writing data and you need to dynamically create partitions on the fly as you
write data, then the DataSourceReader can gather the current list of partitions and pass it
on to DataWriter via DataWriterFactory. The DataWriter can consult that list and if a row
is encountered that does not exist then create it (again note that the partition creation
operation needs to be idempotent OR the DataWriter needs to check for the partition before
trying to create as it may have been already created by another DataWriter).

As for partition add/list/drop/alter, I don't think that concept/notion applies to all datasources
(e.g. filesystem).
Also, the concept of a Spark partition may not translate into the underlying datasource partition.

At the same time I did see a discussion thread on catalog operations for V2 API - although,
probably Spark partitions do not map one-to-one to the underlying partitions.

Probably a good place to introduce partition info is to add a method/object called "meta"
to a dataset and allow the datasource to describe itself (e.g. table permissions, table partitions
and specs, datasource info (e.g. cluster), etc.).

E.g. something like this

With just meta method
dataset.meta = {optional datasource specific info}

Or with meta as an intermediate object with several operations

However, if you are look 

On 9/16/18, 1:24 AM, "tigerquoll" <> wrote:

    I've been following the development of the new data source abstraction with
    keen interest.  One of the issues that has occurred to me as I sat down and
    planned how I would implement a data source is how I would support
    manipulating partitions.
    My reading of the current prototype is that Data source v2 APIs expose
    enough of a concept of a partition to support communicating record
    distribution particulars to catalyst, but does not represent partitions as a
    concept that the end user of the data sources can manipulate.
    The end users of data sources need to be able to add/drop/modify and list
    partitions. For example, many systems require partitions to be created
    before records are added to them.  
    For batch use-cases, it may be possible for users to manipulate partitions
    from within the environment that the data source interfaces to, but for
    streaming use-cases, this is not at all practical.
    Two ways I can think of doing this are:
    1. Allow "pass-through" commands to the underlying data source
    2. Have a generic concept of partitions exposed to the end user via the data
    source API and Spark SQL DML.
    I'm keen for option 2 but recognise that its possible there are better
    alternatives out there.
    Sent from:

View raw message