spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: [VOTE][SPIP] SPARK-22026 data source v2 write path
Date Mon, 16 Oct 2017 06:43:20 GMT
+1


On 10/12/17 20:10, Liwei Lin wrote:
> +1 !
>
> Cheers,
> Liwei
>
> On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.khan@gmail.com 
> <mailto:vaquar.khan@gmail.com>> wrote:
>
>     +1
>
>     Regards,
>     Vaquar khan
>
>     On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen.xu@databricks.com
>     <mailto:weichen.xu@databricks.com>> wrote:
>
>         +1
>
>         On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
>         <gatorsmile@gmail.com <mailto:gatorsmile@gmail.com>> wrote:
>
>             +1
>
>             Xiao
>
>             On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
>             <rxin@databricks.com <mailto:rxin@databricks.com>> wrote:
>
>                 +1
>
>                 One thing with MetadataSupport - It's a bad idea to
>                 call it that unless adding new functions in that trait
>                 wouldn't break source/binary compatibility in the future.
>
>
>                 On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
>                 <cloud0fan@gmail.com <mailto:cloud0fan@gmail.com>> wrote:
>
>                     I'm adding my own +1 (binding).
>
>                     On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
>                     <cloud0fan@gmail.com <mailto:cloud0fan@gmail.com>>
>                     wrote:
>
>                         I'm going to update the proposal: for the last
>                         point, although the user-facing API
>                         (`df.write.format(...).option(...).mode(...).save()`)
>                         mixes data and metadata operations, we are
>                         still able to separate them in the data source
>                         write API. We can have a mix-in trait
>                         `MetadataSupport` which has a method
>                         `create(options)`, so that data sources can
>                         mix in this trait and provide metadata
>                         creation support. Spark will call this
>                         `create` method inside `DataFrameWriter.save`
>                         if the specified data source has it.
>
>                         Note that file format data sources can ignore
>                         this new trait and still write data without
>                         metadata(it doesn't have metadata anyway).
>
>                         With this updated proposal, I'm calling a new
>                         vote for the data source v2 write path.
>
>                         The vote will be up for the next 72 hours.
>                         Please reply with your vote:
>
>                         +1: Yeah, let's go forward and implement the SPIP.
>                         +0: Don't really care.
>                         -1: I don't think this is a good idea because
>                         of the following technical reasons.
>
>                         Thanks!
>
>                         On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
>                         <cloud0fan@gmail.com
>                         <mailto:cloud0fan@gmail.com>> wrote:
>
>                             Hi all,
>
>                             After we merge the infrastructure of data
>                             source v2 read path, and have some
>                             discussion for the write path, now I'm
>                             sending this email to call a vote for Data
>                             Source v2 write path.
>
>                             The full document of the Data Source API
>                             V2 is:
>                             https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>                             <https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit>
>
>                             The ready-for-review PR that implements
>                             the basic infrastructure for the write path:
>                             https://github.com/apache/spark/pull/19269
>                             <https://github.com/apache/spark/pull/19269>
>
>
>                             The Data Source V1 write path asks
>                             implementations to write a DataFrame
>                             directly, which is painful:
>                             1. Exposing upper-level API like DataFrame
>                             to Data Source API is not good for
>                             maintenance.
>                             2. Data sources may need to preprocess the
>                             input data before writing, like
>                             cluster/sort the input by some columns.
>                             It's better to do the preprocessing in
>                             Spark instead of in the data source.
>                             3. Data sources need to take care of
>                             transaction themselves, which is hard. And
>                             different data sources may come up with a
>                             very similar approach for the transaction,
>                             which leads to many duplicated codes.
>
>                             To solve these pain points, I'm proposing
>                             the data source v2 writing framework which
>                             is very similar to the reading framework,
>                             i.e., WriteSupport -> DataSourceV2Writer
>                             -> DataWriterFactory -> DataWriter.
>
>                             Data Source V2 write path follows the
>                             existing FileCommitProtocol, and have
>                             task/job level commit/abort, so that data
>                             sources can implement transaction easier.
>
>                             We can create a mix-in trait for
>                             DataSourceV2Writer to specify the
>                             requirement for input data, like
>                             clustering and ordering.
>
>                             Spark provides a very simple protocol for
>                             uses to connect to data sources. A common
>                             way to write a dataframe to data sources:
>                             `df.write.format(...).option(...).mode(...).save()`.
>                             Spark passes the options and save mode to
>                             data sources, and schedules the write job
>                             on the input data. And the data source
>                             should take care of the metadata, e.g.,
>                             the JDBC data source can create the table
>                             if it doesn't exist, or fail the job and
>                             ask users to create the table in the
>                             corresponding database first. Data sources
>                             can define some options for users to carry
>                             some metadata information like
>                             partitioning/bucketing.
>
>
>                             The vote will be up for the next 72 hours.
>                             Please reply with your vote:
>
>                             +1: Yeah, let's go forward and implement
>                             the SPIP.
>                             +0: Don't really care.
>                             -1: I don't think this is a good idea
>                             because of the following technical reasons.
>
>                             Thanks!
>
>
>
>
>
>
>


Mime
View raw message