spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Williams (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
Date Fri, 17 Oct 2014 22:28:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175647#comment-14175647
] 

Ryan Williams commented on SPARK-3967:
--------------------------------------

I've been debugging this issue as well and I think I've found an issue in {{org.apache.spark.util.Utils}}
that is contributing to / causing the problem:

{{Files.move}} on [line 390|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L390]
is called even if {{targetFile}} exists and {{tempFile}} and {{targetFile}} are equal.

The check on [line 379|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L379]
seems to imply the desire to skip a redundant overwrite if the file is already there and has
the contents that it should have.

Gating the {{Files.move}} call on a further {{if (!targetFile.exists)}} fixes the issue for
me; attached is a patch of the change.

In practice all of my executors that hit this code path are finding every dependency JAR to
already exist and be exactly equal to what they need it to be, meaning they were all needlessly
overwriting all of their dependency JARs, and now are all basically no-op-ing in {{Utils.fetchFile}};
I've not determined who/what is putting the JARs there, why the issue only crops up in {{yarn-cluster}}
mode (or {{--master yarn --deploy-mode cluster}}), etc., but it seems like either way this
patch is probably desirable.


> Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs
are located on different disks/partitions
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3967
>                 URL: https://issues.apache.org/jira/browse/SPARK-3967
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Christophe PR√ČAUD
>         Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch
>
>
> Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client
mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list
of directories which are located on different disks/partitions.
> Steps to reproduce:
> 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located
on different partitions (the more you set, the more likely it will be to reproduce the bug):
> (...)
> <property>
>   <name>yarn.nodemanager.local-dirs</name>
>   <value>file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir</value>
> </property>
> (...)
> 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently
randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message