spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-2532) Fix issues with consolidated shuffle
Date Fri, 01 Aug 2014 20:59:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia updated SPARK-2532:
---------------------------------

         Description: 
Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks
unfortunately :) ).

Consolidated shuffle is broken in multiple ways in spark :

a) Task failure(s) can cause the state to become inconsistent.

b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent.
(As part of exception/error handling).

c) Some of the api in block writer causes implementation issues - for example: a revert is
always followed by close : but the implemention tries to keep them separate, resulting in
surface for errors.

d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively
written to : it computes length by subtracting next offset from current offset (or length
if this is last offset)- the latter fails when fetch is happening in parallel to write.
Note, this happens even if there are no task failures of any kind !
This usually results in stream corruption or decompression errors.


  was:

Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks
unfortunately :) ).

Consolidated shuffle is broken in multiple ways in spark :

a) Task failure(s) can cause the state to become inconsistent.

b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent.
(As part of exception/error handling).

c) Some of the api in block writer causes implementation issues - for example: a revert is
always followed by close : but the implemention tries to keep them separate, resulting in
surface for errors.

d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively
written to : it computes length by subtracting next offset from current offset (or length
if this is last offset)- the latter fails when fetch is happening in parallel to write.
Note, this happens even if there are no task failures of any kind !
This usually results in stream corruption or decompression errors.


    Target Version/s: 1.1.0

> Fix issues with consolidated shuffle
> ------------------------------------
>
>                 Key: SPARK-2532
>                 URL: https://issues.apache.org/jira/browse/SPARK-2532
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0
>         Environment: All
>            Reporter: Mridul Muralidharan
>            Assignee: Mridul Muralidharan
>            Priority: Critical
>
> Will file PR with changes as soon as merge is done (earlier merge became outdated in
2 weeks unfortunately :) ).
> Consolidated shuffle is broken in multiple ways in spark :
> a) Task failure(s) can cause the state to become inconsistent.
> b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent.
> (As part of exception/error handling).
> c) Some of the api in block writer causes implementation issues - for example: a revert
is always followed by close : but the implemention tries to keep them separate, resulting
in surface for errors.
> d) Fetching data from consolidated shuffle files can go badly wrong if the file is being
actively written to : it computes length by subtracting next offset from current offset (or
length if this is last offset)- the latter fails when fetch is happening in parallel to write.
> Note, this happens even if there are no task failures of any kind !
> This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message