spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Roberts <AROBE...@uk.ibm.com>
Subject Re: Spark 2.0.0 - Java vs Scala performance difference
Date Thu, 01 Sep 2016 13:24:54 GMT
On Java vs Scala: Sean's right that behind the scenes you'll be calling 
JVM based APIs anyway (e.g. sun.misc.unsafe for Tungsten) and that the 
vast majority of Apache Spark's important logic is written in Scala.

Would be an interesting experiment to write the same functioning program 
using the Java APIs vs Scala APIs just to see if there is a noticeable 
difference: I'm thinking in terms of how the Scala implementation 
libraries perform at runtime, with profiling (we use Healthcenter, tprof, 
or just microbenchmarking with prints and timers), we've seen lots of code 
in Scala itself to do with (un)boxing and instanceOf checks that could do 
with some TLC for performance.

Now quite outdated but still shows that writing what's concise (Scala) 
isn't always best for performance: 
https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html

So if we just to stick to Java we may not hit those overheads as often 
(there's a talk by my colleague on boosting performance from a Java 
implementer's perspective at https://www.youtube.com/watch?v=rcVTM-71bZk), 
but I don't expect the differences to be enormous. Full disclosure that I 
work for IBM and one of our goals is to improve Apache Spark and our Java 
implementation to perform fast together.

There's also the obvious trade-off of developer productivity and code 
maintainability (more Java devs than Scala devs), so my suggestion is that 
if you're much better at writing Java or Scala code, use that for what is 
considered the real important performance critical logic - be aware that 
you're going be hitting the Apache Spark codebase written in Scala anyway, 
so there's only so much to be gained here.

I also think that Just in Time Compiler implementations are generally 
better at optimising what's written as Java code instead of Scala code as 
knowing the types way ahead of time and where we can make codepath 
shortcuts in the bytecode execution should deliver a slight performance 
improvements. I am keen to come up with some solid recommendations based 
on evidence for us all to benefit from.




From:   Aseem Bansal <asmbansal2@gmail.com>
To:     ayan guha <guha.ayan@gmail.com>
Cc:     Sean Owen <sowen@cloudera.com>, user <user@spark.apache.org>
Date:   01/09/2016 13:11
Subject:        Re: Spark 2.0.0 - Java vs Scala performance difference



there is already a mail thread for scala vs python. check the archives

On Thu, Sep 1, 2016 at 5:18 PM, ayan guha <guha.ayan@gmail.com> wrote:
How about Scala vs Python?

On Thu, Sep 1, 2016 at 7:27 PM, Sean Owen <sowen@cloudera.com> wrote:
I can't think of a situation where it would be materially different.
Both are using the JVM-based APIs directly. Here and there there's a
tiny bit of overhead in using the Java APIs because something is
translated from a Java-style object to a Scala-style object, but this
is generally trivial.

On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal <asmbansal2@gmail.com> 
wrote:
> Hi
>
> Would there be any significant performance difference when using Java 
vs.
> Scala API?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org




-- 
Best Regards,
Ayan Guha


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Mime
View raw message