spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nileshc <>
Subject Python API Performance
Date Thu, 30 Jan 2014 16:30:01 GMT
Hi there,

I need to do some matrix multiplication stuff inside the mappers, and trying
to choose between Python and Scala for writing the Spark MR jobs. I'm
equally fluent with Python and Java, and find Scala pretty easy too for what
it's worth. Going with Python would let me use numpy + scipy, which is
blazing fast when compared to Java libraries like Colt etc. Configuring Java
with BLAS seems to be a pain when compared to scipy (direct apt-get
installs, or pip).

I posted a couple of comments on this answer at StackOverflow:
Basically it states that as of Spark 0.7.2, the Python API would be slower
than Scala. What's the performance scenario now? The fork issue seems to be
fixed. How about serialization? Can it match Java/Scala Writable-like
serialization (having knowledge of object type beforehand, reducing I/O)
performance? Also, a probably silly question - loops seem to be slow in
Python in general, do you think this can turn out to be an issue?

Bottomline, should I choose Python for computation-intensive algorithms like
PageRank? Scipy gives me an edge, but does the framework kill it?

Any help, insights, benchmarks will be much appreciated. :)


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message