spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shiv Prashant Sood <shivprash...@gmail.com>
Subject Re: DataSourceV2 : Transactional Write support
Date Mon, 05 Aug 2019 23:57:05 GMT
Thanks all for the clarification.

Regards,
Shiv

On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue <rblue@netflix.com.invalid> wrote:

> > What you could try instead is intermediate output: inserting into
> temporal table in executors, and move inserted records to the final table
> in driver (must be atomic)
>
> I think that this is the approach that other systems (maybe sqoop?) have
> taken. Insert into independent temporary tables, which can be done quickly.
> Then for the final commit operation, union and insert into the final table.
> In a lot of cases, JDBC databases can do that quickly as well because the
> data is already on disk and just needs to added to the final table.
>
> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim <kabhwan@gmail.com> wrote:
>
>> I asked similar question for end-to-end exactly-once with Kafka, and
>> you're correct distributed transaction is not supported. Introducing
>> distributed transaction like "two-phase commit" requires huge change on
>> Spark codebase and the feedback was not positive.
>>
>> What you could try instead is intermediate output: inserting into
>> temporal table in executors, and move inserted records to the final table
>> in driver (must be atomic).
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood <shivprashant@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I understood that DataSourceV2 supports Transactional write and wanted
>>> to  implement that in JDBC DataSource V2 connector ( PR#25211
>>> <https://github.com/apache/spark/pull/25211> ).
>>>
>>> Don't see how this is feasible for JDBC based connector.  The FW suggest
>>> that EXECUTOR send a commit message  to DRIVER, and actual commit
>>> should only be done by DRIVER after receiving all commit confirmations.
>>> This will not work for JDBC  as commits have to happen on the JDBC
>>> Connection which is maintained by the EXECUTORS and JDBCConnection  is not
>>> serializable that it can be sent to the DRIVER.
>>>
>>> Am i right in thinking that this cannot be supported for JDBC? My goal
>>> is to either fully write or roll back the dataframe write operation.
>>>
>>> Thanks in advance for your help.
>>>
>>> Regards,
>>> Shiv
>>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
View raw message