spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Compare with Storm
Date Mon, 28 Oct 2013 01:17:23 GMT
Hey Howard,

Great to hear that you're looking at Spark Streaming!

> We have some in house real time streaming jobs written for Storm and want to see the
possibility to migrate to Spark Streaming in the future as our team all think Spark is a very
promising technologies (one platform to execute both realtime & interactive jobs) and
with excellent documentations.
> 1. If we focus on the streaming capabilities, what are the main pros/cons at the current
moment, is Spark streaming suitable for production use now?

Spark Streaming provides all the facilities for continuous use, but currently master fault
tolerance takes a bit more manual setup (in particular you have to manually restart your app
from a checkpoint if the master crashes). We plan to improve this later. Several groups are
using it in production though as far as I know, so it's worth a try as long as you read about
this stuff.

> 2. In term of message reliability and transaction support, I assume both need to rely
on zookeeper, right?

As far as I know neither of them uses ZooKeeper for message reliability -- they implement
this within the application, and maybe use ZooKeeper for leader election. Spark Streaming
is designed to compute reliably (everything has "exactly-once" semantics by default) and does
this using a mechanism called "discretized streams" (
Storm only provides "at-least-once" delivery by default and does not provide fault tolerance
for state, though Trident (built on top) gives you that and exactly-once semantics at the
cost of using a database to do transactions. Storm also uses ZooKeeper for automatic leader
election while Spark requires the manual setup of master restart as described above.

> 3. In Storm, we are using Topology/Spout/Bolt as the data model, how to translate them
to Spark streaming if we want to rewrite our system? Are there any migration guide?

This is probably the biggest difference -- Spark Streaming only works in terms of high-level
operations, such as map() and groupBy(), and doesn't expose a lower-level "nodes exchanging
messages" model. You can probably take the code you use within a bolt and use it as a map
function or whatever the right operation is in Spark or the way you want to combine data.

> 4. Can Spark do distributed RPC like Storm?

Since Spark Streaming is just running on top of Spark, you can actually run code to grab the
state of the computation any time as an RDD and then run Spark queries on it. So again while
it's not exactly the same as the thing called distributed RPC in Storm, it is a way to do
arbitrary parallel computations on your data, and it comes with a high-level API.

View raw message