spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srikanth <srikanth...@gmail.com>
Subject "Job duration" and "Processing time" don't match
Date Thu, 08 Sep 2016 19:31:44 GMT
Hello,

I was looking at Spark streaming UI and noticed a big difference between
"Processing time" and "Job duration"

[image: Inline image 1]

Processing time/Output Op duration is show as 50s but sum of all job
duration is ~25s.
What is causing this difference? Based on logs I know that the batch
actually took 50s.

[image: Inline image 2]

The job that is taking most of time is
    joinRDD.toDS()
           .write.format("com.databricks.spark.csv")
           .mode(SaveMode.Append)
           .options(Map("mode" -> "DROPMALFORMED", "delimiter" -> "\t",
"header" -> "false"))
           .partitionBy("entityId", "regionId", "eventDate")
           .save(outputPath)

Removing SaveMode.Append really speeds things up and also the mismatch
between Job duration and processing time disappears.
I'm not able to explain what is causing this though.

Srikanth

Mime
View raw message