spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-26682) Task attempt ID collision causes lost data
Date Mon, 02 Mar 2020 19:46:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dongjoon Hyun updated SPARK-26682:
----------------------------------
    Affects Version/s: 2.1.0

> Task attempt ID collision causes lost data
> ------------------------------------------
>
>                 Key: SPARK-26682
>                 URL: https://issues.apache.org/jira/browse/SPARK-26682
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.1.3, 2.3.2, 2.4.0
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>            Priority: Blocker
>              Labels: data-loss
>             Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> We recently tracked missing data to a collision in the fake Hadoop task attempt ID created
when using Hadoop OutputCommitters. This is similar to SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a FetchFailedException
and Spark resubmitted the stage. Because only one task was affected, the original stage attempt
continued running tasks that had been resubmitted. Another task ran two attempts concurrently
on the same executor, but had the same attempt number because they were from different stage
attempts. Because the attempt number was the same, the task used the same temp locations.
That caused one attempt to fail because a file path already existed, and that attempt then
removed the shared temp location and deleted the other task's data. When the second attempt
succeeded, it committed partial data.
> The problem was that both attempts had the same partition and attempt numbers, despite
being run in different stages, and that was used to create a Hadoop task attempt ID on which
the temp location was based. The fix is to use Spark's global task attempt ID, which is a
counter, instead of attempt number because attempt number is reused in stage attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message