spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marius Soutier <mps....@gmail.com>
Subject Re: Python vs Scala performance
Date Wed, 22 Oct 2014 11:29:20 GMT
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that...

On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.chammas@gmail.com> wrote:

> What version of Spark are you running? Some recent changes to how PySpark works relative
to Scala Spark may explain things.
> 
> PySpark should not be that much slower, not by a stretch.
> 
> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <ashic@live.com> wrote:
> I'm no expert, but looked into how the python bits work a while back (was trying to assess
what it would take to add F# support). It seems python hosts a jvm inside of it, and talks
to "scala spark" in that jvm. The python server bit "translates" the python calls to those
in the jvm. The python spark context is like an adapter to the jvm spark context. If you're
seeing performance discrepancies, this might be the reason why. If the code can be organised
to require fewer interactions with the adapter, that may improve things. Take this with a
pinch of salt...I might be way off on this :)
> 
> Cheers,
> Ashic.
> 
> > From: mps.dev@gmail.com
> > Subject: Python vs Scala performance
> > Date: Wed, 22 Oct 2014 12:00:41 +0200
> > To: user@spark.apache.org
> 
> > 
> > Hi there,
> > 
> > we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed
JSON data per day. I have written a couple of word count-like Scala jobs that essentially
pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes
to complete.
> > 
> > Now one of the data scientists on the team wants to do write some jobs using Python.
To learn Spark, he rewrote one of my Scala jobs in Python. From the API-side, everything looks
more or less identical. However his jobs take between 5-8 hours to complete! We can also see
that the execution plan is quite different, I’m seeing writes to the output much later than
in Scala.
> > 
> > Is Python I/O really that slow?
> > 
> > 
> > Thanks
> > - Marius
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> > 
> 


Mime
View raw message