spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pratik gawande <>
Subject Fw: Significant performance difference for same spark job in scala vs pyspark
Date Fri, 06 May 2016 04:47:18 GMT

I am new to spark. For one of  job I am finding significant performance difference when run
in pyspark vs scala. Could you please let me know if this is known and scala is preferred
over python for writing spark jobs? Also DAG visualization shows completely different DAGs
for scala and pyspark. I have pasted DAG for both using toDebugString() method. Let me know
if you need any additional information.

Time for Job in scala : 52 secs

Time for job in pyspark : 4.2 min

Scala code in Zepplin:

val lines = sc.textFile("s3://[test-bucket]/output2/")
val words = lines.flatMap(line => line.split(" "))
val filteredWords = words.filter(word => word.equals("Gutenberg") || word.equals("flower")
|| word.equals("a"))
val wordMap = => (word, 1)).reduceByKey(_ + _)

pyspark code in Zepplin:

lines = sc.textFile("s3://[test-bucket]/output2/")
words = lines.flatMap(lambda x: x.split())
filteredWords = words.filter(lambda x: (x == "Gutenberg" or x == "flower" or x == "a"))
result = x: (x, 1)).reduceByKey(lambda a,b: a+b).collect()
print result

Scala final RDD:

print wordMap.toDebugString()

 lines: org.apache.spark.rdd.RDD[String] = s3://[test-bucket]/output2/ MapPartitionsRDD[108]
at textFile at <console>:30 words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[109]
at flatMap at <console>:31 filteredWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[110]
at filter at <console>:33 wordMap: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[112]
at reduceByKey at <console>:35 (10) ShuffledRDD[112] at reduceByKey at <console>:35
[] +-(10) MapPartitionsRDD[111] at map at <console>:35 [] | MapPartitionsRDD[110] at
filter at <console>:33 [] | MapPartitionsRDD[109] at flatMap at <console>:31 []
| s3://[test-bucket]/output2/ MapPartitionsRDD[108] at textFile at <console>:30 [] |
s3://[test-bucket]/output2/ HadoopRDD[107] at textFile at <console>:30 []

PySpark final RDD:


(10) PythonRDD[119] at RDD at PythonRDD.scala:43 [] | s3://[test-bucket]/output2/ MapPartitionsRDD[114]
at textFile at null:-1 [] | s3://[test-bucket]/output2/HadoopRDD[113] at textFile at null:-1
[] PythonRDD[120] at RDD at PythonRDD.scala:43



View raw message