spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramon Tuason <Ramon.Tua...@microsoft.com.INVALID>
Subject [Spark SQL] Failure Scenarios involving JDBC and SQL databases
Date Wed, 09 Jan 2019 03:28:28 GMT
Hi all,

I'm writing a data source that shares similarities with Spark's own JDBC implementation, and
I'd like to ask a question about how Spark handles failure scenarios involving JDBC and SQL
databases. To my understanding, if an executor dies while it's running a task, Spark will
revive the executor and try to re-run that task. However, how does this play out in the context
of data integrity and Spark's JDBC data source API (e.g. df.write.format("jdbc").option(...).save())?

In the savePartition function of JdbcUtils.scala<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala>,
we see Spark calling the commit and rollback functions of the Java connection object generated
from the database url/credentials provided by the user (screenshot below). Can someone provide
some guidance on what exactly happens under certain failure scenarios? For example, if an
executor dies right after commit() finishes or before rollback() is called, does Spark try
to re-run the task and write the same data partition again, essentially creating duplicate
committed rows in the database? What happens if the executor dies in the middle of calling
commit() or rollback()?

Thanks for your help!

[cid:image001.png@01D4A77C.C29ADFA0]

Mime
View raw message