It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 GB RAM and 4 cores, but fast enough for the currently 40 Gigs a day. Data is on HDFS in EBS volumes. Each file is a Gzip-compress collection of JSON objects, each one between 115-120 MB to be near the HDFS block size.

One core per worker is permanently used by a job that allows SQL queries over Parquet files.

On 22.10.2014, at 16:18, Arian Pasquali <> wrote:

Interesting thread Marius,
Btw, I'm curious about your cluster size. 
How small it is in terms of ram and cores.


2014-10-22 13:17 GMT+01:00 Nicholas Chammas <>:

Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0 release notes affect things at all?

  • PySpark now performs external spilling during aggregations. Old behavior can be restored by setting spark.shuffle.spill to false.
  • PySpark uses a new heuristic for determining the parallelism of shuffle operations. Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster.


On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <> wrote:
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that...

On 22.10.2014, at 13:02, Nicholas Chammas <> wrote:

What version of Spark are you running? Some recent changes to how PySpark works relative to Scala Spark may explain things.

PySpark should not be that much slower, not by a stretch.

On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <> wrote:
I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to "scala spark" in that jvm. The python server bit "translates" the python calls to those in the jvm. The python spark context is like an adapter to the jvm spark context. If you're seeing performance discrepancies, this might be the reason why. If the code can be organised to require fewer interactions with the adapter, that may improve things. Take this with a pinch of salt...I might be way off on this :)


> From:
> Subject: Python vs Scala performance
> Date: Wed, 22 Oct 2014 12:00:41 +0200
> To:

> Hi there,
> we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to complete.
> Now one of the data scientists on the team wants to do write some jobs using Python. To learn Spark, he rewrote one of my Scala jobs in Python. From the API-side, everything looks more or less identical. However his jobs take between 5-8 hours to complete! We can also see that the execution plan is quite different, I’m seeing writes to the output much later than in Scala.
> Is Python I/O really that slow?
> Thanks
> - Marius
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail: