spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dilip Biswal" <dbis...@us.ibm.com>
Subject Re: DataSourceWriter V2 Api questions
Date Mon, 10 Sep 2018 18:48:25 GMT
<div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10.5pt"
><div dir="ltr" ><span style="color: rgb(18, 18, 18); font-family: &quot;Helvetica
Neue&quot;, Helvetica, Arial, &quot;Lucida Grande&quot;, sans-serif; font-size:
12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight:
400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color:
rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display:
inline !important; float: none;" >This is a pretty big challenge in general for data sources
-- for the vast majority of data stores, the boundary of a transaction is per client. That
is, you can't have two clients doing writes and coordinating a single transaction. That's
certainly the case for almost all relational databases. Spark, on the other hand, will have
multiple clients (consider each task a client) writing to the same underlying data store.</span></div>
<div dir="ltr" >&nbsp;</div>
<div dir="ltr" ><span style="color: rgb(18, 18, 18); font-family: &quot;Helvetica
Neue&quot;, Helvetica, Arial, &quot;Lucida Grande&quot;, sans-serif; font-size:
12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight:
400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform:
none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color:
rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display:
inline !important; float: none;" >DB&gt;&gt; Perhaps we can explore two-phase commit
protocol (aka XA) for this ? Not sure how easy it is to implement this though :-)</span></div>
<div dir="ltr" >&nbsp;</div>
<div dir="ltr" >Regards,<br>Dilip Biswal<br>Tel: 408-463-4980<br>dbiswal@us.ibm.com</div>
<div dir="ltr" >&nbsp;</div>
<div dir="ltr" >&nbsp;</div>
<blockquote data-history-content-modified="1" dir="ltr" style="border-left:solid #aaaaaa
2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" >----- Original
message -----<br>From: Reynold Xin &lt;rxin@databricks.com&gt;<br>To:
Ryan Blue &lt;rblue@netflix.com&gt;<br>Cc: ross.lawley@gmail.com, dev &lt;dev@spark.apache.org&gt;<br>Subject:
Re: DataSourceWriter V2 Api questions<br>Date: Mon, Sep 10, 2018 10:26 AM<br>&nbsp;
<div dir="ltr" >I don't think the problem is just whether we have a starting point for
write. As a matter of fact there's always a starting point for write, whether it is explicit
or implicit.
<div>&nbsp;</div>
<div>This is a pretty big challenge in general for data sources -- for the vast majority
of data stores, the boundary of a transaction is per client. That is, you can't have two clients
doing writes and coordinating a single transaction. That's certainly the case for almost all
relational databases. Spark, on the other hand, will have multiple clients (consider each
task a client) writing to the same underlying data store.</div></div>&nbsp;

<div><div dir="ltr" >On Mon, Sep 10, 2018 at 10:19 AM Ryan Blue &lt;<a
href="mailto:rblue@netflix.com" target="_blank" >rblue@netflix.com</a>&gt; wrote:</div>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex" ><div
dir="ltr" >Ross, I think the intent is to create a single transaction on the driver, write
as part of it in each task, and then commit the transaction once the tasks complete. Is that
possible in your implementation?
<div>&nbsp;</div>
<div>I think that part of this is made more difficult by not having a clear starting
point for a write, which we are fixing in the redesign of the v2 API. That will have a method
that creates a Write to track the operation. That can create your transaction when it is created
and commit the transaction when commit is called on it.</div>
<div>&nbsp;</div>
<div>rb</div></div>&nbsp;

<div><div dir="ltr" >On Mon, Sep 10, 2018 at 9:05 AM Reynold Xin &lt;<a
href="mailto:rxin@databricks.com" target="_blank" >rxin@databricks.com</a>&gt;
wrote:</div>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex" ><div
dir="ltr" >Typically people do it via transactions, or staging tables.
<div>&nbsp;</div></div>&nbsp;

<div><div dir="ltr" >On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley &lt;<a
href="mailto:ross.lawley@gmail.com" target="_blank" >ross.lawley@gmail.com</a>&gt;
wrote:</div>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex" ><div
dir="ltr" ><div dir="ltr" ><div>Hi all,</div>
<div>&nbsp;</div>
<div>I've been prototyping an implementation of the DataSource V2 writer for the MongoDB
Spark Connector and I have a couple of questions about how its intended to be used with database
systems. According to the Javadoc for DataWriter.commit():</div>
<div>&nbsp;</div>
<div><i>"this method should still "hide" the written data and ask the DataSourceWriter
at driver side to do the final commit via WriterCommitMessage"</i></div>
<div>&nbsp;</div>
<div>Although, MongoDB now has transactions, it doesn't have a way to "hide" the data
once it has been written. So as soon as the DataWriter has committed the data, it has been
inserted/updated in the collection and is discoverable - thereby breaking the documented contract.</div>
<div>&nbsp;</div>
<div>I was wondering how other databases systems plan to implement this API and meet
the contract as per the Javadoc?</div>
<div>&nbsp;</div>
<div>Many thanks</div>
<div>&nbsp;</div>
<div>Ross</div></div></div></blockquote></div></blockquote></div>&nbsp;

<div>&nbsp;</div>--

<div data-smartmail="gmail_signature" dir="ltr" ><div dir="ltr" ><div><div
dir="ltr" >Ryan Blue
<div>Software Engineer</div>
<div><span style="font-size:12.8px" >Netflix</span></div></div></div></div></div></blockquote></div></blockquote>
<div dir="ltr" >&nbsp;</div></div><BR>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message