spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Vacek <>
Subject Re: Is There Any Benchmarks Comparing C++ MPI with Spark
Date Mon, 16 Jun 2014 23:07:20 GMT
Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather.  There are also a few additional primitives, mostly
based on a join.  Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal.  Spark has one theoretical disadvantage
compared to MPI: every collective operation requires the task closures to
be distributed, and---to my knowledge---this is an O(p) operation.
 (Perhaps there has been some progress on this??)  That O(p) term spoils
any parallel isoefficiency analysis.  In MPI, binaries are distributed
once, and wireup is a O(log p).  In practice, it prevents
reasonable-looking strong scaling curves; with MPI, the overall runtime
will stop declining and level off with increasing p, but with Spark it can
go up sharply.  So, Spark is great for a small cluster.  For a huge
cluster, or a job with a lot of collectives, it isn't so great.

On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux <>

> I guess you have to understand the difference of architecture. I don't
> know much about C++ MPI but it is basically MPI whereas Spark is inspired
> from Hadoop MapReduce and optimised for reading/writing large amount of
> data with a smart caching and locality strategy. Intuitively, if you have a
> high ratio CPU/message then MPI might be better. But what is the ratio is
> hard to say and in the end this ratio will depend on your specific
> application. Finally, in real life, this difference of performance due to
> the architecture may not be the only or the most important factor of choice
> like Michael already explained.
> Bertrand
> On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler <> wrote:
>> Hello Wei,
>> I talk from experience of writing many HPC distributed application using
>> Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
>> Virtual Machine (PVM) way before that back in the 90's.  I can say with
>> absolute certainty:
>> *Any gains you believe there are because "C++ is faster than Java/Scala"
>> will be completely blown by the inordinate amount of time you spend
>> debugging your code and/or reinventing the wheel to do even basic tasks
>> like linear regression.*
>> There are undoubtably some very specialised use-cases where MPI and its
>> brethren still dominate for High Performance Computing tasks -- like for
>> example the nuclear decay simulations run by the US Department of Energy on
>> supercomputers where they've invested billions solving that use case.
>> Spark is part of the wider "Big Data" ecosystem, and its biggest
>> advantages are traction amongst internet scale companies, hundreds of
>> developers contributing to it and a community of thousands using it.
>> Need a distributed fault-tolerant file system? Use HDFS.  Need a
>> distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
>> between your worker processes? Use Zookeeper.  Need to run it on a flexible
>> grid of computing resources and handle failures? Run it on Mesos!
>> The barrier to entry to get going with Spark is very low, download the
>> latest distribution and start the Spark shell.  Language bindings for Scala
>> / Java / Python are excellent meaning you spend less time writing
>> boilerplate code, and more time solving problems.
>> Even if you believe you *need* to use native code to do something
>> specific, like fetching HD video frames from satellite video capture cards
>> -- wrap it in a small native library and use the Java Native Access
>> interface to call it from your Java/Scala code.
>> Have fun, and if you get stuck we're here to help!
>> MC
>> On 16 June 2014 08:17, Wei Da <> wrote:
>>> Hi guys,
>>> We are making choices between C++ MPI and Spark. Is there any official
>>> comparation between them? Thanks a lot!
>>> Wei

View raw message