spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-26712) Single disk broken causing YarnShuffleSerivce not available
Date Tue, 26 Feb 2019 15:41:01 GMT


Sean Owen resolved SPARK-26712.
    Resolution: Won't Fix

> Single disk broken causing YarnShuffleSerivce not available
> -----------------------------------------------------------
>                 Key: SPARK-26712
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 2.1.0, 2.4.0
>            Reporter: liupengcheng
>            Priority: Major
> Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery enabled, however,
the recovery file is under a single directory, which may be unavailable if disk broken. So
if a NM restart happen(may be caused by kill or some reason), the shuffle service can not
startĀ even if there are executors on the node.
> This may finally cause job failures(if node or executors on it not blacklisted), or at
least, it will cause resource waste.(shuffle from this node always failed.)
> For long running spark applications, this problem may be more serious.
> So I think we should support multi directories(multi disk) for this recovery. and change
to good directory when the disk of current directory is broken.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message