spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marius Soutier <mps....@gmail.com>
Subject Python vs Scala performance
Date Wed, 22 Oct 2014 10:00:41 GMT
Hi there,

we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON
data per day. I have written a couple of word count-like Scala jobs that essentially pull
in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes
to complete.

Now one of the data scientists on the team wants to do write some jobs using Python. To learn
Spark, he rewrote one of my Scala jobs in Python. From the API-side, everything looks more
or less identical. However his jobs take between 5-8 hours to complete! We can also see that
the execution plan is quite different, I’m seeing writes to the output much later than in
Scala.

Is Python I/O really that slow?


Thanks
- Marius


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message