spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: DataSourceV2 : Transactional Write support
Date Tue, 06 Aug 2019 02:39:38 GMT
I agree with the temp table approach. One idea is: maybe we only need one
temp table, and each task writes to this temp table. At the end we read the
data from the temp table and write it to the target table. AFAIK JDBC can
handle concurrent table writing very well, and it's better than creating
thousands of temp tables for one write job(assume the input RDD has
thousands of partitions).

On Tue, Aug 6, 2019 at 7:57 AM Shiv Prashant Sood <shivprashant@gmail.com>
wrote:

> Thanks all for the clarification.
>
> Regards,
> Shiv
>
> On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
>> > What you could try instead is intermediate output: inserting into
>> temporal table in executors, and move inserted records to the final table
>> in driver (must be atomic)
>>
>> I think that this is the approach that other systems (maybe sqoop?) have
>> taken. Insert into independent temporary tables, which can be done quickly.
>> Then for the final commit operation, union and insert into the final table.
>> In a lot of cases, JDBC databases can do that quickly as well because the
>> data is already on disk and just needs to added to the final table.
>>
>> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim <kabhwan@gmail.com> wrote:
>>
>>> I asked similar question for end-to-end exactly-once with Kafka, and
>>> you're correct distributed transaction is not supported. Introducing
>>> distributed transaction like "two-phase commit" requires huge change on
>>> Spark codebase and the feedback was not positive.
>>>
>>> What you could try instead is intermediate output: inserting into
>>> temporal table in executors, and move inserted records to the final table
>>> in driver (must be atomic).
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood <
>>> shivprashant@gmail.com> wrote:
>>>
>>>> All,
>>>>
>>>> I understood that DataSourceV2 supports Transactional write and wanted
>>>> to  implement that in JDBC DataSource V2 connector ( PR#25211
>>>> <https://github.com/apache/spark/pull/25211> ).
>>>>
>>>> Don't see how this is feasible for JDBC based connector.  The FW
>>>> suggest that EXECUTOR send a commit message  to DRIVER, and actual
>>>> commit should only be done by DRIVER after receiving all commit
>>>> confirmations. This will not work for JDBC  as commits have to happen on
>>>> the JDBC Connection which is maintained by the EXECUTORS and
>>>> JDBCConnection  is not serializable that it can be sent to the DRIVER.
>>>>
>>>> Am i right in thinking that this cannot be supported for JDBC? My goal
>>>> is to either fully write or roll back the dataframe write operation.
>>>>
>>>> Thanks in advance for your help.
>>>>
>>>> Regards,
>>>> Shiv
>>>>
>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Mime
View raw message