spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: [VOTE][SPIP] SPARK-22026 data source v2 write path
Date Mon, 16 Oct 2017 16:30:09 GMT
+1

On Sun, Oct 15, 2017 at 11:43 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

> +1
>
> On 10/12/17 20:10, Liwei Lin wrote:
>
> +1 !
>
> Cheers,
> Liwei
>
> On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.khan@gmail.com>
> wrote:
>
>> +1
>>
>> Regards,
>> Vaquar khan
>>
>> On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen.xu@databricks.com> wrote:
>>
>> +1
>>
>> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsmile@gmail.com> wrote:
>>
>>> +1
>>>
>>> Xiao
>>>
>>> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <rxin@databricks.com> wrote:
>>>
>>>> +1
>>>>
>>>> One thing with MetadataSupport - It's a bad idea to call it that unless
>>>> adding new functions in that trait wouldn't break source/binary
>>>> compatibility in the future.
>>>>
>>>>
>>>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0fan@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm adding my own +1 (binding).
>>>>>
>>>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0fan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm going to update the proposal: for the last point, although the
>>>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>>>>>> mixes data and metadata operations, we are still able to separate
them in
>>>>>> the data source write API. We can have a mix-in trait `MetadataSupport`
>>>>>> which has a method `create(options)`, so that data sources can mix
in this
>>>>>> trait and provide metadata creation support. Spark will call this
`create`
>>>>>> method inside `DataFrameWriter.save` if the specified data source
has it.
>>>>>>
>>>>>> Note that file format data sources can ignore this new trait and
>>>>>> still write data without metadata(it doesn't have metadata anyway).
>>>>>>
>>>>>> With this updated proposal, I'm calling a new vote for the data
>>>>>> source v2 write path.
>>>>>>
>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>> vote:
>>>>>>
>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>> +0: Don't really care.
>>>>>> -1: I don't think this is a good idea because of the following
>>>>>> technical reasons.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0fan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> After we merge the infrastructure of data source v2 read path,
and
>>>>>>> have some discussion for the write path, now I'm sending this
email to call
>>>>>>> a vote for Data Source v2 write path.
>>>>>>>
>>>>>>> The full document of the Data Source API V2 is:
>>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>
>>>>>>> The ready-for-review PR that implements the basic infrastructure
for
>>>>>>> the write path:
>>>>>>> https://github.com/apache/spark/pull/19269
>>>>>>>
>>>>>>>
>>>>>>> The Data Source V1 write path asks implementations to write a
>>>>>>> DataFrame directly, which is painful:
>>>>>>> 1. Exposing upper-level API like DataFrame to Data Source API
is not
>>>>>>> good for maintenance.
>>>>>>> 2. Data sources may need to preprocess the input data before
>>>>>>> writing, like cluster/sort the input by some columns. It's better
to do the
>>>>>>> preprocessing in Spark instead of in the data source.
>>>>>>> 3. Data sources need to take care of transaction themselves,
which
>>>>>>> is hard. And different data sources may come up with a very similar
>>>>>>> approach for the transaction, which leads to many duplicated
codes.
>>>>>>>
>>>>>>> To solve these pain points, I'm proposing the data source v2
writing
>>>>>>> framework which is very similar to the reading framework, i.e.,
>>>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory
-> DataWriter.
>>>>>>>
>>>>>>> Data Source V2 write path follows the existing FileCommitProtocol,
>>>>>>> and have task/job level commit/abort, so that data sources can
implement
>>>>>>> transaction easier.
>>>>>>>
>>>>>>> We can create a mix-in trait for DataSourceV2Writer to specify
the
>>>>>>> requirement for input data, like clustering and ordering.
>>>>>>>
>>>>>>> Spark provides a very simple protocol for uses to connect to
data
>>>>>>> sources. A common way to write a dataframe to data sources:
>>>>>>> `df.write.format(...).option(...).mode(...).save()`.
>>>>>>> Spark passes the options and save mode to data sources, and
>>>>>>> schedules the write job on the input data. And the data source
should take
>>>>>>> care of the metadata, e.g., the JDBC data source can create the
table if it
>>>>>>> doesn't exist, or fail the job and ask users to create the table
in the
>>>>>>> corresponding database first. Data sources can define some options
for
>>>>>>> users to carry some metadata information like partitioning/bucketing.
>>>>>>>
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with
your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>> technical reasons.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>
>

Mime
View raw message