spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amo A (JIRA)" <>
Subject [jira] [Commented] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments
Date Fri, 23 Jan 2015 22:42:35 GMT


Amo A commented on SPARK-5209:

So after some further testing as promised, we managed to find that the job works without any
issue when you set the "spark.akka.heartbeat.pauses" to 6000 (as recommended in
while all the other config settings in your conf file remains the same.

After doing a bit of reading around akka actors behavior and the impact of these settings,
(if my understanding on how akka works with my limited knowledge is correct?)

spark.akka.heartbeat.pauses = 6000 ( used to be 600 in Spark 1.1.1)
spark.akka.failure-detector.threshold = 300
spark.akka.heartbeat.interval = 1000

I guess the time between two heartbeats for a particular actor response (spark.akka.heartbeat.interval)
has to be smaller than the total of the pause (due to GC or higher load) + the padding to
activate the failure detector (spark.akka.failure-detector.threshold) and trigger a kill.

Looking at spark 1.1.1 doc, it seem to have 600 as the default value and however, by the 1.2.0
doc as you notice in the above link the default is suggested to be 6000. If my above theory/understanding
is correct, I wonder how it worked in spark 1.1.x. Would someone be able to help explaining

Thank you.

> Jobs fail with "unexpected value" exception in certain environments
> -------------------------------------------------------------------
>                 Key: SPARK-5209
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Amazon Elastic Map Reduce
>            Reporter: Sven Krasser
>         Attachments: driver_log.txt, exec_log.txt,,, spark-defaults.conf
> Jobs fail consistently and reproducibly with exceptions of the following type in PySpark
using Spark 1.2.0:
> {noformat}
> 2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] executor.Executor (Logging.scala:logError(96))
- Exception in task 27.0 in stage 0.0 (TID 28)
> org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@4c09f3e0)
> {noformat}
> The issue appeared the first time in Spark 1.2.0 and is sensitive to the environment
(configuration, cluster size), i.e. some changes to the environment will cause the error to
not occur.
> The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch an EMR
cluster with the following parameters (this will bootstrap Spark 1.2.0 onto it):
> {code}
> aws emr create-cluster --region us-west-1 --no-auto-terminate \
>    --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \
>    --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]'
>    --ami-version 3.3 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
>    InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name "Spark Issue
Repro" \
>    --visible-to-all-users --applications Name=Ganglia
> {code}
> Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}.
> Run {{~/spark/bin/spark-submit}} to generate a test data set on HDFS.
Then lastly run {{~/spark/bin/spark-submit}} to reproduce the error.
> Driver and executor logs are attached. For reference, a spark-user thread on the topic
is here:

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message