spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marius Soutier <mps....@gmail.com>
Subject Re: Python vs Scala performance
Date Wed, 22 Oct 2014 14:51:36 GMT
It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 GB RAM and 4
cores, but fast enough for the currently 40 Gigs a day. Data is on HDFS in EBS volumes. Each
file is a Gzip-compress collection of JSON objects, each one between 115-120 MB to be near
the HDFS block size.

One core per worker is permanently used by a job that allows SQL queries over Parquet files.

On 22.10.2014, at 16:18, Arian Pasquali <arian@arianpasquali.com> wrote:

> Interesting thread Marius,
> Btw, I'm curious about your cluster size. 
> How small it is in terms of ram and cores.
> 
> Arian
> 
> 2014-10-22 13:17 GMT+01:00 Nicholas Chammas <nicholas.chammas@gmail.com>:
> Total guess without knowing anything about your code: Do either of these two notes from
the 1.1.0 release notes affect things at all?
> 
> PySpark now performs external spilling during aggregations. Old behavior can be restored
by setting spark.shuffle.spill to false.
> PySpark uses a new heuristic for determining the parallelism of shuffle operations. Old
behavior can be restored by setting spark.default.parallelism to the number of cores in the
cluster.
> Nick
> 
> ​
> 
> On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps.dev@gmail.com> wrote:
> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that...
> 
> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.chammas@gmail.com> wrote:
> 
>> What version of Spark are you running? Some recent changes to how PySpark works relative
to Scala Spark may explain things.
>> 
>> PySpark should not be that much slower, not by a stretch.
>> 
>> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <ashic@live.com> wrote:
>> I'm no expert, but looked into how the python bits work a while back (was trying
to assess what it would take to add F# support). It seems python hosts a jvm inside of it,
and talks to "scala spark" in that jvm. The python server bit "translates" the python calls
to those in the jvm. The python spark context is like an adapter to the jvm spark context.
If you're seeing performance discrepancies, this might be the reason why. If the code can
be organised to require fewer interactions with the adapter, that may improve things. Take
this with a pinch of salt...I might be way off on this :)
>> 
>> Cheers,
>> Ashic.
>> 
>> > From: mps.dev@gmail.com
>> > Subject: Python vs Scala performance
>> > Date: Wed, 22 Oct 2014 12:00:41 +0200
>> > To: user@spark.apache.org
>> 
>> > 
>> > Hi there,
>> > 
>> > we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed
JSON data per day. I have written a couple of word count-like Scala jobs that essentially
pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes
to complete.
>> > 
>> > Now one of the data scientists on the team wants to do write some jobs using
Python. To learn Spark, he rewrote one of my Scala jobs in Python. From the API-side, everything
looks more or less identical. However his jobs take between 5-8 hours to complete! We can
also see that the execution plan is quite different, I’m seeing writes to the output much
later than in Scala.
>> > 
>> > Is Python I/O really that slow?
>> > 
>> > 
>> > Thanks
>> > - Marius
>> > 
>> > 
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> > 
>> 
> 
> 
> 


Mime
View raw message