spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evo Eftimov" <>
Subject RE: Spark or Storm
Date Wed, 17 Jun 2015 18:53:17 GMT
The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and "redundant"
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again 


This is a legacy from Spark Batch where such approach DOES make sense 


So to recap - in Spark Streaming, the driver keeps serializing and
transmitting the same Tasks (comprising a Job) for every new DStream RDD,
which then get re-launched ie new JVM Threads launched within each Executor
(JVM) and then the tasks report their final execution status to the driver
(only the last has real functional purpose)


An optimization (provided Spark Streaming was implemented from scratch)
could be to launch the Stages/Tasks of a Streaming Job as constantly running
Threads (Demons/Agents) within the Executors and leave the DStream RDD
"stream" through them as only the final execution status for each DSTream
RDD and some periodical heartbeats (of the Agents) are reported to the


Essentially this gives you are Pipeline Architecture (of stringed Agents)
which is a well known Parallel Programming Patterns especially suitable for
streaming data 


From: Matei Zaharia [] 
Sent: Wednesday, June 17, 2015 7:14 PM
To: Enno Shioji
Cc: Ashish Soni; ayan guha; Sabarish Sasidharan; Spark Enthusiast; Will
Briggs; user; Sateesh Kavuri
Subject: Re: Spark or Storm


This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use
reduceByKeyAndWindow to keep track of a running count) is exactly-once. When
you write to a storage system, no matter which streaming framework you use,
you'll have to make sure the writes are idempotent, because the storage
system can't know whether you meant to write the same data again or not. But
the place where Spark Streaming helps over Storm, etc is for tracking state
within your computation. Without that facility, you'd not only have to make
sure that writes are idempotent, but you'd have to make sure that updates to
your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too.




On Jun 17, 2015, at 8:26 AM, Enno Shioji <> wrote:


The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read


that refers to the latest version, it says:


Semantics of output operations

Output operations (like foreachRDD) have at-least once semantics, that is,
the transformed data may get written to an external entity more than once in
the event of a worker failure. While this is acceptable for saving to file
systems using the saveAs***Files operations (as the file will simply get
overwritten with the same data), additional effort may be necessary to
achieve exactly-once semantics. There are two approaches.

.         Idempotent updates: Multiple attempts always write the same data.
For example, saveAs***Files always writes the same data to the generated

.         Transactional updates: All updates are made transactionally so
that updates are made exactly once atomically. One way to do this would be
the following.

o   Use the batch time (available in foreachRDD) and the partition index of
the transformed RDD to create an identifier. This identifier uniquely
identifies a blob data in the streaming application.

o   Update external system with this blob transactionally (that is, exactly
once, atomically) using the identifier. That is, if the identifier is not
already committed, commit the partition data and the identifier atomically.
Else if this was already committed, skip the update.


So either you make the update idempotent, or you have to make it
transactional yourself, and the suggested mechanism is very similar to what
Storm does.





On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni <> wrote:


As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.


Any feedback will be helpful if anyone is tried the same.




On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji <> wrote:

AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.


I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.




On Wed, Jun 17, 2015 at 3:13 PM, ayan guha <> wrote:

Thanks for this. It's kcl based kinesis application. But because its just a
Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?

On 17 Jun 2015 23:07, "Enno Shioji" <> wrote:

Hi Ayan,


Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their "processor" interface for that. In this example,
it's incrementing a counter:


Instead of incrementing a counter, you could do your transformation and send
it to HBase.







On Wed, Jun 17, 2015 at 1:40 PM, ayan guha <> wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with Kinesis. If
all you need to do is straight forward transformation and you are reading
from Kinesis to begin with, it might be an easier option to just do the
transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis? 

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.



On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni <> wrote:

My Use case is below 


We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute 


Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs  to calculate your bill on real
time basis so when you login to your account you can see all those variable
as how much you used and how much is left and what is your bill till date
,Also there are different rules which need to be considered when you
calculate the total bill one simple rule will be 0-500 min it is free but
above it is $1 a min.


How do i maintain a shared state  ( total amount , total min , total data
etc ) so that i know how much i accumulated at any given point as events for
same phone can go to any node / executor. 


Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?






On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <> wrote:

I guess both. In terms of syntax, I was comparing it with Trident.


If you are joining, Spark Streaming actually does offer windowed join out of
the box. We couldn't use this though as our event stream can grow
"out-of-sync", so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics are



> Also, what do you mean by "No Back Pressure" ?


So when a topology is overloaded, Storm is designed so that it will stop
reading from the source. Spark on the other hand, will keep reading from the
source and spilling it internally. This maybe fine, in fairness, but it does
mean you have to worry about the persistent store usage in the processing
cluster, whereas with Storm you don't have to worry because the messages
just remain in the data store.


Spark came up with the idea of rate limiting, but I don't feel this is as
nice as back pressure because it's very difficult to tune it such that you
don't cap the cluster's processing power but yet so that it will prevent the
persistent storage to get used up.



On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast <>

When you say Storm, did you mean Storm with Trident or Storm?


My use case does not have simple transformation. There are complex events
that need to be generated by joining the incoming event stream.


Also, what do you mean by "No Back PRessure" ?





On Wednesday, 17 June 2015 11:57 AM, Enno Shioji <> wrote:


We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.


Some of the important draw backs are:

Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)

There is also no exactly-once semantics. (updateStateByKey can achieve this
semantics, but is not practical if you have any significant amount of state
because it does so by dumping the entire state on every checkpointing)


There are also some minor drawbacks that I'm sure will be fixed quickly,
like no task timeout, not being able to read from Kafka using multiple
nodes, data loss hazard with Kafka.


It's also not possible to attain very low latency in Spark, if that's what
you need.


The pos for Spark is the concise and IMO more intuitive syntax, especially
if you compare it with Storm's Java API.


I admit I might be a bit biased towards Storm tho as I'm more familiar with


Also, you can do some processing with Kinesis. If all you need to do is
straight forward transformation and you are reading from Kinesis to begin
with, it might be an easier option to just do the transformation in Kinesis.






On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan
<> wrote:

Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.


Probably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should then be sent to a new pipeline for further

How can this be achieved using Spark?


On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast
<> wrote:

I have a use-case where a stream of Incoming events have to be aggregated
and joined to create Complex events. The aggregation will have to happen at
an interval of 1 minute (or less).


The pipeline is :

                                  send events
enrich event

Upstream services -------------------> KAFKA ---------> event Stream
Processor ------------> Complex Event Processor ------------> Elastic


>From what I understand, Storm will make a very good ESP and Spark Streaming
will make a good CEP.


But, we are also evaluating Storm with Trident.


How does Spark Streaming compare with Storm with Trident?


Sridhar Chellappa







On Wednesday, 17 June 2015 10:02 AM, ayan guha <> wrote:


I have a similar scenario where we need to bring data from kinesis to hbase.
Data volecity is 20k per 10 mins. Little manipulation of data will be
required but that's regardless of the tool so we will be writing that piece
in Java pojo. 

All env is on aws. Hbase is on a long running EMR and kinesis on a separate


On 17 Jun 2015 12:13, "Will Briggs" <> wrote:

The programming models for the two frameworks are conceptually rather
different; I haven't worked with Storm for quite some time, but based on my
old experience with it, I would equate Spark Streaming more with Storm's
Trident API, rather than with the raw Bolt API. Even then, there are
significant differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.


On June 16, 2015, at 9:46 PM, wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:








Best Regards,
Ayan Guha






View raw message