spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karlson <ksonsp...@siberie.de>
Subject Spark stages very slow to complete
Date Mon, 01 Jun 2015 14:52:50 GMT
Hi,

In all (pyspark) Spark jobs, that become somewhat more involved, I am 
experiencing the issue that some stages take a very long time to 
complete and sometimes don't at all. This clearly correlates with the 
size of my input data. Looking at the stage details for one such stage, 
I am wondering where Spark spends all this time. Take this table of the 
stages task metrics for example:

Metric				Min		25th 		percentile	Median		75th percentile	Max
Duration			1.4 min		1.5 min		1.7 min		1.9 min		2.3 min
Scheduler Delay			1 ms		3 ms		4 ms		5 ms		23 ms
Task Deserialization Time	1 ms		2 ms		3 ms		8 ms		22 ms
GC Time				0 ms		0 ms		0 ms		0 ms		0 ms
Result Serialization Time	0 ms		0 ms		0 ms		0 ms		1 ms
Getting Result Time		0 ms		0 ms		0 ms		0 ms		0 ms
Input Size / Records		23.9 KB / 1	24.0 KB / 1	24.1 KB / 1	24.1 KB / 
1	24.3 KB / 1

Why is the overall duration almost 2min? Where is all this time spent, 
when no progress of the stages is visible? The progress bar simply 
displays 0 succeeded tasks for a very long time before sometimes slowly 
progressing.

Also, the name of the stage displayed above is `javaToPython at 
null:-1`, which I find very uninformative. I don't even know which 
action exactly is responsible for this stage. Does anyone experience 
similar issues or have any advice for me?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message