spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amo A (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments
Date Fri, 23 Jan 2015 22:42:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290153#comment-14290153
] 

Amo A edited comment on SPARK-5209 at 1/23/15 10:42 PM:
--------------------------------------------------------

So after reproducing this with given steps, we managed to find that the job works without
any issue when you set the "spark.akka.heartbeat.pauses" to 6000 (as recommended in http://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior)
while all the other config settings in your conf file remains the same.

After doing a bit of reading around akka actors behavior and the impact of these settings,
(if my understanding on how akka works with my limited knowledge is correct?)

spark.akka.heartbeat.pauses = 6000 ( used to be 600 in Spark 1.1.1)
spark.akka.failure-detector.threshold = 300
spark.akka.heartbeat.interval = 1000

I guess the time between two heartbeats for a particular actor response (spark.akka.heartbeat.interval)
has to be smaller than the total of the pause (due to GC or higher load) + the padding to
activate the failure detector (spark.akka.failure-detector.threshold) and trigger a kill.

Looking at spark 1.1.1 doc, it seem to have 600 as the default value and however, by the 1.2.0
doc as you notice in the above link the default is suggested to be 6000. If my above theory/understanding
is correct, I wonder how it worked in spark 1.1.x. Would someone be able to help explaining
this?

Thank you.


was (Author: amodha):
So after some further testing as promised, we managed to find that the job works without any
issue when you set the "spark.akka.heartbeat.pauses" to 6000 (as recommended in http://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior)
while all the other config settings in your conf file remains the same.

After doing a bit of reading around akka actors behavior and the impact of these settings,
(if my understanding on how akka works with my limited knowledge is correct?)

spark.akka.heartbeat.pauses = 6000 ( used to be 600 in Spark 1.1.1)
spark.akka.failure-detector.threshold = 300
spark.akka.heartbeat.interval = 1000

I guess the time between two heartbeats for a particular actor response (spark.akka.heartbeat.interval)
has to be smaller than the total of the pause (due to GC or higher load) + the padding to
activate the failure detector (spark.akka.failure-detector.threshold) and trigger a kill.

Looking at spark 1.1.1 doc, it seem to have 600 as the default value and however, by the 1.2.0
doc as you notice in the above link the default is suggested to be 6000. If my above theory/understanding
is correct, I wonder how it worked in spark 1.1.x. Would someone be able to help explaining
this?

Thank you.

> Jobs fail with "unexpected value" exception in certain environments
> -------------------------------------------------------------------
>
>                 Key: SPARK-5209
>                 URL: https://issues.apache.org/jira/browse/SPARK-5209
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Amazon Elastic Map Reduce
>            Reporter: Sven Krasser
>         Attachments: driver_log.txt, exec_log.txt, gen_test_data.py, repro.py, spark-defaults.conf
>
>
> Jobs fail consistently and reproducibly with exceptions of the following type in PySpark
using Spark 1.2.0:
> {noformat}
> 2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] executor.Executor (Logging.scala:logError(96))
- Exception in task 27.0 in stage 0.0 (TID 28)
> org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@4c09f3e0)
> {noformat}
> The issue appeared the first time in Spark 1.2.0 and is sensitive to the environment
(configuration, cluster size), i.e. some changes to the environment will cause the error to
not occur.
> The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch an EMR
cluster with the following parameters (this will bootstrap Spark 1.2.0 onto it):
> {code}
> aws emr create-cluster --region us-west-1 --no-auto-terminate \
>    --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \
>    --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]'
\
>    --ami-version 3.3 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
\
>    InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name "Spark Issue
Repro" \
>    --visible-to-all-users --applications Name=Ganglia
> {code}
> Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}.
> Run {{~/spark/bin/spark-submit gen_test_data.py}} to generate a test data set on HDFS.
Then lastly run {{~/spark/bin/spark-submit repro.py}} to reproduce the error.
> Driver and executor logs are attached. For reference, a spark-user thread on the topic
is here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CC5A80834-8F1C-4C0A-89F9-E04D3F1C4469@gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message