spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sumit Khanna <>
Subject hdfs persist rollbacks when spark job is killed
Date Mon, 08 Aug 2016 06:35:34 GMT

the use case is as follows :

say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
(like a basic write to hdfs  command), but say due to some reason or rhyme
my job got killed, when the run was in the mid of it, meaning lets say I
was only able to insert 100K rows when my job got killed.

twist is that I might actually be upserting, and even in append only cases,
my delta change data that is being inserted / written in this run might
actually be spanning across various partitions.

Now what I am looking for is something to role the changes back, like the
batch insertion should be all or nothing, and even if it is partition, it
must must be atomic to each row/ unit of insertion.

Kindly help.


View raw message