spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enno Shioji <eshi...@gmail.com>
Subject Re: Spark or Storm
Date Wed, 17 Jun 2015 18:49:48 GMT
Hi Matei,


Ah, can't get more accurate than from the horse's mouth... If you don't
mind helping me understand it correctly..

>From what I understand, Storm Trident does the following (when used with
Kafka):
1) Sit on Kafka Spout and create batches
2) Assign global sequential ID to the batches
3) Make sure that all result of processed batches are written once to
TridentState, *in order* (for example, by skipping batches that were
already applied once, ultimately by using Zookeeper)

TridentState is an interface that you have to implement, and the underlying
storage has to be transactional for this to work. The necessary skipping
etc. is handled by Storm.

In case of Spark Streaming, I understand that
1) There is no global ordering; e.g. an output operation for batch
consisting of offset [4,5,6] can be invoked before the operation for offset
[1,2,3]
2) If you wanted to achieve something similar to what TridentState does,
you'll have to do it yourself (for example using Zookeeper)

Is this a correct understanding?




On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:

> This documentation is only for writes to an external system, but all the
> counting you do within your streaming app (e.g. if you use
> reduceByKeyAndWindow to keep track of a running count) is exactly-once.
> When you write to a storage system, no matter which streaming framework you
> use, you'll have to make sure the writes are idempotent, because the
> storage system can't know whether you meant to write the same data again or
> not. But the place where Spark Streaming helps over Storm, etc is for
> tracking state within your computation. Without that facility, you'd not
> only have to make sure that writes are idempotent, but you'd have to make
> sure that updates to your own internal state (e.g. reduceByKeyAndWindow)
> are exactly-once too.
>
> Matei
>
>
> On Jun 17, 2015, at 8:26 AM, Enno Shioji <eshioji@gmail.com> wrote:
>
> The thing is, even with that improvement, you still have to make updates
> idempotent or transactional yourself. If you read
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
>
> that refers to the latest version, it says:
>
> Semantics of output operations
>
> Output operations (like foreachRDD) have *at-least once* semantics, that
> is, the transformed data may get written to an external entity more than
> once in the event of a worker failure. While this is acceptable for saving
> to file systems using the saveAs***Files operations (as the file will
> simply get overwritten with the same data), additional effort may be
> necessary to achieve exactly-once semantics. There are two approaches.
>
>    -
>
>    *Idempotent updates*: Multiple attempts always write the same data.
>    For example, saveAs***Files always writes the same data to the
>    generated files.
>    -
>
>    *Transactional updates*: All updates are made transactionally so that
>    updates are made exactly once atomically. One way to do this would be the
>    following.
>    - Use the batch time (available in foreachRDD) and the partition index
>       of the transformed RDD to create an identifier. This identifier uniquely
>       identifies a blob data in the streaming application.
>       - Update external system with this blob transactionally (that is,
>       exactly once, atomically) using the identifier. That is, if the identifier
>       is not already committed, commit the partition data and the identifier
>       atomically. Else if this was already committed, skip the update.
>
>
> So either you make the update idempotent, or you have to make it
> transactional yourself, and the suggested mechanism is very similar to what
> Storm does.
>
>
>
>
> On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni <asoni.learn@gmail.com>
> wrote:
>
>> @Enno
>> As per the latest version and documentation Spark Streaming does offer
>> exactly once semantics using improved kafka integration , Not i have not
>> tested yet.
>>
>> Any feedback will be helpful if anyone is tried the same.
>>
>> http://koeninger.github.io/kafka-exactly-once/#7
>>
>>
>> https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
>>
>>
>>
>> On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji <eshioji@gmail.com> wrote:
>>
>>> AFAIK KCL is *supposed* to provide fault tolerance and load balancing
>>> (plus additionally, elastic scaling unlike Storm), Kinesis providing the
>>> coordination. My understanding is that it's like a naked Storm worker
>>> process that can consequently only do map.
>>>
>>> I haven't really used it tho, so can't really comment how it compares to
>>> Spark/Storm. Maybe somebody else will be able to comment.
>>>
>>>
>>>
>>> On Wed, Jun 17, 2015 at 3:13 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>>> Thanks for this. It's kcl based kinesis application. But because its
>>>> just a Java application we are thinking to use spark on EMR or storm for
>>>> fault tolerance and load balancing. Is it a correct approach?
>>>> On 17 Jun 2015 23:07, "Enno Shioji" <eshioji@gmail.com> wrote:
>>>>
>>>>> Hi Ayan,
>>>>>
>>>>> Admittedly I haven't done much with Kinesis, but if I'm not mistaken
>>>>> you should be able to use their "processor" interface for that. In this
>>>>> example, it's incrementing a counter:
>>>>> https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java
>>>>>
>>>>> Instead of incrementing a counter, you could do your transformation
>>>>> and send it to HBase.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 17, 2015 at 1:40 PM, ayan guha <guha.ayan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Great discussion!!
>>>>>>
>>>>>> One qs about some comment: Also, you can do some processing with
>>>>>> Kinesis. If all you need to do is straight forward transformation
and you
>>>>>> are reading from Kinesis to begin with, it might be an easier option
to
>>>>>> just do the transformation in Kinesis
>>>>>>
>>>>>> - Do you mean KCL application? Or some kind of processing
>>>>>> withinKineis?
>>>>>>
>>>>>> Can you kindly share a link? I would definitely pursue this route
as
>>>>>> our transformations are really simple.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni <asoni.learn@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My Use case is below
>>>>>>>
>>>>>>> We are going to receive lot of event as stream ( basically Kafka
>>>>>>> Stream ) and then we need to process and compute
>>>>>>>
>>>>>>> Consider you have a phone contract with ATT and every call /
sms /
>>>>>>> data useage you do is an event and then it needs  to calculate
your bill on
>>>>>>> real time basis so when you login to your account you can see
all those
>>>>>>> variable as how much you used and how much is left and what is
your bill
>>>>>>> till date ,Also there are different rules which need to be considered
when
>>>>>>> you calculate the total bill one simple rule will be 0-500 min
it is free
>>>>>>> but above it is $1 a min.
>>>>>>>
>>>>>>> How do i maintain a shared state  ( total amount , total min
, total
>>>>>>> data etc ) so that i know how much i accumulated at any given
point as
>>>>>>> events for same phone can go to any node / executor.
>>>>>>>
>>>>>>> Can some one please tell me how can i achieve this is spark as
in
>>>>>>> storm i can have a bolt which can do this ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <eshioji@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I guess both. In terms of syntax, I was comparing it with
Trident.
>>>>>>>>
>>>>>>>> If you are joining, Spark Streaming actually does offer windowed
>>>>>>>> join out of the box. We couldn't use this though as our event
stream can
>>>>>>>> grow "out-of-sync", so we had to implement something on top
of Storm. If
>>>>>>>> your event streams don't become out of sync, you may find
the built-in join
>>>>>>>> in Spark Streaming useful. Storm also has a join keyword
but its semantics
>>>>>>>> are different.
>>>>>>>>
>>>>>>>>
>>>>>>>> > Also, what do you mean by "No Back Pressure" ?
>>>>>>>>
>>>>>>>> So when a topology is overloaded, Storm is designed so that
it will
>>>>>>>> stop reading from the source. Spark on the other hand, will
keep reading
>>>>>>>> from the source and spilling it internally. This maybe fine,
in fairness,
>>>>>>>> but it does mean you have to worry about the persistent store
usage in the
>>>>>>>> processing cluster, whereas with Storm you don't have to
worry because the
>>>>>>>> messages just remain in the data store.
>>>>>>>>
>>>>>>>> Spark came up with the idea of rate limiting, but I don't
feel this
>>>>>>>> is as nice as back pressure because it's very difficult to
tune it such
>>>>>>>> that you don't cap the cluster's processing power but yet
so that it will
>>>>>>>> prevent the persistent storage to get used up.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast <
>>>>>>>> sparkenthusiast@yahoo.in> wrote:
>>>>>>>>
>>>>>>>>> When you say Storm, did you mean Storm with Trident or
Storm?
>>>>>>>>>
>>>>>>>>> My use case does not have simple transformation. There
are complex
>>>>>>>>> events that need to be generated by joining the incoming
event stream.
>>>>>>>>>
>>>>>>>>> Also, what do you mean by "No Back PRessure" ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji <
>>>>>>>>> eshioji@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We've evaluated Spark Streaming vs. Storm and ended up
sticking
>>>>>>>>> with Storm.
>>>>>>>>>
>>>>>>>>> Some of the important draw backs are:
>>>>>>>>> Spark has no back pressure (receiver rate limit can alleviate
this
>>>>>>>>> to a certain point, but it's far from ideal)
>>>>>>>>> There is also no exactly-once semantics. (updateStateByKey
can
>>>>>>>>> achieve this semantics, but is not practical if you have
any significant
>>>>>>>>> amount of state because it does so by dumping the entire
state on every
>>>>>>>>> checkpointing)
>>>>>>>>>
>>>>>>>>> There are also some minor drawbacks that I'm sure will
be fixed
>>>>>>>>> quickly, like no task timeout, not being able to read
from Kafka using
>>>>>>>>> multiple nodes, data loss hazard with Kafka.
>>>>>>>>>
>>>>>>>>> It's also not possible to attain very low latency in
Spark, if
>>>>>>>>> that's what you need.
>>>>>>>>>
>>>>>>>>> The pos for Spark is the concise and IMO more intuitive
syntax,
>>>>>>>>> especially if you compare it with Storm's Java API.
>>>>>>>>>
>>>>>>>>> I admit I might be a bit biased towards Storm tho as
I'm more
>>>>>>>>> familiar with it.
>>>>>>>>>
>>>>>>>>> Also, you can do some processing with Kinesis. If all
you need to
>>>>>>>>> do is straight forward transformation and you are reading
from Kinesis to
>>>>>>>>> begin with, it might be an easier option to just do the
transformation in
>>>>>>>>> Kinesis.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan
<
>>>>>>>>> sabarish.sasidharan@manthan.com> wrote:
>>>>>>>>>
>>>>>>>>> Whatever you write in bolts would be the logic you want
to apply
>>>>>>>>> on your events. In Spark, that logic would be coded in
map() or similar
>>>>>>>>> such  transformations and/or actions. Spark doesn't enforce
a structure for
>>>>>>>>> capturing your processing logic like Storm does.
>>>>>>>>> Regards
>>>>>>>>> Sab
>>>>>>>>> Probably overloading the question a bit.
>>>>>>>>>
>>>>>>>>> In Storm, Bolts have the functionality of getting triggered
on
>>>>>>>>> events. Is that kind of functionality possible with Spark
streaming? During
>>>>>>>>> each phase of the data processing, the transformed data
is stored to the
>>>>>>>>> database and this transformed data should then be sent
to a new pipeline
>>>>>>>>> for further processing
>>>>>>>>>
>>>>>>>>> How can this be achieved using Spark?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast <
>>>>>>>>> sparkenthusiast@yahoo.in> wrote:
>>>>>>>>>
>>>>>>>>> I have a use-case where a stream of Incoming events have
to be
>>>>>>>>> aggregated and joined to create Complex events. The aggregation
will have
>>>>>>>>> to happen at an interval of 1 minute (or less).
>>>>>>>>>
>>>>>>>>> The pipeline is :
>>>>>>>>>                                   send events
>>>>>>>>>                      enrich event
>>>>>>>>> Upstream services -------------------> KAFKA --------->
event
>>>>>>>>> Stream Processor ------------> Complex Event Processor
------------>
>>>>>>>>> Elastic Search.
>>>>>>>>>
>>>>>>>>> From what I understand, Storm will make a very good ESP
and Spark
>>>>>>>>> Streaming will make a good CEP.
>>>>>>>>>
>>>>>>>>> But, we are also evaluating Storm with Trident.
>>>>>>>>>
>>>>>>>>> How does Spark Streaming compare with Storm with Trident?
>>>>>>>>>
>>>>>>>>> Sridhar Chellappa
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   On Wednesday, 17 June 2015 10:02 AM, ayan guha <
>>>>>>>>> guha.ayan@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have a similar scenario where we need to bring data
from kinesis
>>>>>>>>> to hbase. Data volecity is 20k per 10 mins. Little manipulation
of data
>>>>>>>>> will be required but that's regardless of the tool so
we will be writing
>>>>>>>>> that piece in Java pojo.
>>>>>>>>> All env is on aws. Hbase is on a long running EMR and
kinesis on a
>>>>>>>>> separate cluster.
>>>>>>>>> TIA.
>>>>>>>>> Best
>>>>>>>>> Ayan
>>>>>>>>> On 17 Jun 2015 12:13, "Will Briggs" <wrbriggs@gmail.com>
wrote:
>>>>>>>>>
>>>>>>>>> The programming models for the two frameworks are conceptually
>>>>>>>>> rather different; I haven't worked with Storm for quite
some time, but
>>>>>>>>> based on my old experience with it, I would equate Spark
Streaming more
>>>>>>>>> with Storm's Trident API, rather than with the raw Bolt
API. Even then,
>>>>>>>>> there are significant differences, but it's a bit closer.
>>>>>>>>>
>>>>>>>>> If you can share your use case, we might be able to provide
better
>>>>>>>>> guidance.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Will
>>>>>>>>>
>>>>>>>>> On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wrote:
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I am evaluating spark VS storm ( spark streaming  ) and
i am not
>>>>>>>>> able to see what is equivalent of Bolt in storm inside
spark.
>>>>>>>>>
>>>>>>>>> Any help will be appreciated on this ?
>>>>>>>>>
>>>>>>>>> Thanks ,
>>>>>>>>> Ashish
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>

Mime
View raw message