spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Soni <asoni.le...@gmail.com>
Subject Re: Spark or Storm
Date Wed, 17 Jun 2015 14:51:00 GMT
@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html



On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji <eshioji@gmail.com> wrote:

> AFAIK KCL is *supposed* to provide fault tolerance and load balancing
> (plus additionally, elastic scaling unlike Storm), Kinesis providing the
> coordination. My understanding is that it's like a naked Storm worker
> process that can consequently only do map.
>
> I haven't really used it tho, so can't really comment how it compares to
> Spark/Storm. Maybe somebody else will be able to comment.
>
>
>
> On Wed, Jun 17, 2015 at 3:13 PM, ayan guha <guha.ayan@gmail.com> wrote:
>
>> Thanks for this. It's kcl based kinesis application. But because its just
>> a Java application we are thinking to use spark on EMR or storm for fault
>> tolerance and load balancing. Is it a correct approach?
>> On 17 Jun 2015 23:07, "Enno Shioji" <eshioji@gmail.com> wrote:
>>
>>> Hi Ayan,
>>>
>>> Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
>>> should be able to use their "processor" interface for that. In this
>>> example, it's incrementing a counter:
>>> https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java
>>>
>>> Instead of incrementing a counter, you could do your transformation and
>>> send it to HBase.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 17, 2015 at 1:40 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>>> Great discussion!!
>>>>
>>>> One qs about some comment: Also, you can do some processing with
>>>> Kinesis. If all you need to do is straight forward transformation and you
>>>> are reading from Kinesis to begin with, it might be an easier option to
>>>> just do the transformation in Kinesis
>>>>
>>>> - Do you mean KCL application? Or some kind of processing withinKineis?
>>>>
>>>> Can you kindly share a link? I would definitely pursue this route as
>>>> our transformations are really simple.
>>>>
>>>> Best
>>>>
>>>> On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni <asoni.learn@gmail.com>
>>>> wrote:
>>>>
>>>>> My Use case is below
>>>>>
>>>>> We are going to receive lot of event as stream ( basically Kafka
>>>>> Stream ) and then we need to process and compute
>>>>>
>>>>> Consider you have a phone contract with ATT and every call / sms /
>>>>> data useage you do is an event and then it needs  to calculate your bill
on
>>>>> real time basis so when you login to your account you can see all those
>>>>> variable as how much you used and how much is left and what is your bill
>>>>> till date ,Also there are different rules which need to be considered
when
>>>>> you calculate the total bill one simple rule will be 0-500 min it is
free
>>>>> but above it is $1 a min.
>>>>>
>>>>> How do i maintain a shared state  ( total amount , total min , total
>>>>> data etc ) so that i know how much i accumulated at any given point as
>>>>> events for same phone can go to any node / executor.
>>>>>
>>>>> Can some one please tell me how can i achieve this is spark as in
>>>>> storm i can have a bolt which can do this ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <eshioji@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I guess both. In terms of syntax, I was comparing it with Trident.
>>>>>>
>>>>>> If you are joining, Spark Streaming actually does offer windowed
join
>>>>>> out of the box. We couldn't use this though as our event stream can
grow
>>>>>> "out-of-sync", so we had to implement something on top of Storm.
If your
>>>>>> event streams don't become out of sync, you may find the built-in
join in
>>>>>> Spark Streaming useful. Storm also has a join keyword but its semantics
are
>>>>>> different.
>>>>>>
>>>>>>
>>>>>> > Also, what do you mean by "No Back Pressure" ?
>>>>>>
>>>>>> So when a topology is overloaded, Storm is designed so that it will
>>>>>> stop reading from the source. Spark on the other hand, will keep
reading
>>>>>> from the source and spilling it internally. This maybe fine, in fairness,
>>>>>> but it does mean you have to worry about the persistent store usage
in the
>>>>>> processing cluster, whereas with Storm you don't have to worry because
the
>>>>>> messages just remain in the data store.
>>>>>>
>>>>>> Spark came up with the idea of rate limiting, but I don't feel this
>>>>>> is as nice as back pressure because it's very difficult to tune it
such
>>>>>> that you don't cap the cluster's processing power but yet so that
it will
>>>>>> prevent the persistent storage to get used up.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast <
>>>>>> sparkenthusiast@yahoo.in> wrote:
>>>>>>
>>>>>>> When you say Storm, did you mean Storm with Trident or Storm?
>>>>>>>
>>>>>>> My use case does not have simple transformation. There are complex
>>>>>>> events that need to be generated by joining the incoming event
stream.
>>>>>>>
>>>>>>> Also, what do you mean by "No Back PRessure" ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji <
>>>>>>> eshioji@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> We've evaluated Spark Streaming vs. Storm and ended up sticking
with
>>>>>>> Storm.
>>>>>>>
>>>>>>> Some of the important draw backs are:
>>>>>>> Spark has no back pressure (receiver rate limit can alleviate
this
>>>>>>> to a certain point, but it's far from ideal)
>>>>>>> There is also no exactly-once semantics. (updateStateByKey can
>>>>>>> achieve this semantics, but is not practical if you have any
significant
>>>>>>> amount of state because it does so by dumping the entire state
on every
>>>>>>> checkpointing)
>>>>>>>
>>>>>>> There are also some minor drawbacks that I'm sure will be fixed
>>>>>>> quickly, like no task timeout, not being able to read from Kafka
using
>>>>>>> multiple nodes, data loss hazard with Kafka.
>>>>>>>
>>>>>>> It's also not possible to attain very low latency in Spark, if
>>>>>>> that's what you need.
>>>>>>>
>>>>>>> The pos for Spark is the concise and IMO more intuitive syntax,
>>>>>>> especially if you compare it with Storm's Java API.
>>>>>>>
>>>>>>> I admit I might be a bit biased towards Storm tho as I'm more
>>>>>>> familiar with it.
>>>>>>>
>>>>>>> Also, you can do some processing with Kinesis. If all you need
to do
>>>>>>> is straight forward transformation and you are reading from Kinesis
to
>>>>>>> begin with, it might be an easier option to just do the transformation
in
>>>>>>> Kinesis.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan <
>>>>>>> sabarish.sasidharan@manthan.com> wrote:
>>>>>>>
>>>>>>> Whatever you write in bolts would be the logic you want to apply
on
>>>>>>> your events. In Spark, that logic would be coded in map() or
similar such
>>>>>>> transformations and/or actions. Spark doesn't enforce a structure
for
>>>>>>> capturing your processing logic like Storm does.
>>>>>>> Regards
>>>>>>> Sab
>>>>>>> Probably overloading the question a bit.
>>>>>>>
>>>>>>> In Storm, Bolts have the functionality of getting triggered on
>>>>>>> events. Is that kind of functionality possible with Spark streaming?
During
>>>>>>> each phase of the data processing, the transformed data is stored
to the
>>>>>>> database and this transformed data should then be sent to a new
pipeline
>>>>>>> for further processing
>>>>>>>
>>>>>>> How can this be achieved using Spark?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast <
>>>>>>> sparkenthusiast@yahoo.in> wrote:
>>>>>>>
>>>>>>> I have a use-case where a stream of Incoming events have to be
>>>>>>> aggregated and joined to create Complex events. The aggregation
will have
>>>>>>> to happen at an interval of 1 minute (or less).
>>>>>>>
>>>>>>> The pipeline is :
>>>>>>>                                   send events
>>>>>>>                    enrich event
>>>>>>> Upstream services -------------------> KAFKA --------->
event Stream
>>>>>>> Processor ------------> Complex Event Processor ------------>
Elastic
>>>>>>> Search.
>>>>>>>
>>>>>>> From what I understand, Storm will make a very good ESP and Spark
>>>>>>> Streaming will make a good CEP.
>>>>>>>
>>>>>>> But, we are also evaluating Storm with Trident.
>>>>>>>
>>>>>>> How does Spark Streaming compare with Storm with Trident?
>>>>>>>
>>>>>>> Sridhar Chellappa
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On Wednesday, 17 June 2015 10:02 AM, ayan guha <
>>>>>>> guha.ayan@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I have a similar scenario where we need to bring data from kinesis
>>>>>>> to hbase. Data volecity is 20k per 10 mins. Little manipulation
of data
>>>>>>> will be required but that's regardless of the tool so we will
be writing
>>>>>>> that piece in Java pojo.
>>>>>>> All env is on aws. Hbase is on a long running EMR and kinesis
on a
>>>>>>> separate cluster.
>>>>>>> TIA.
>>>>>>> Best
>>>>>>> Ayan
>>>>>>> On 17 Jun 2015 12:13, "Will Briggs" <wrbriggs@gmail.com>
wrote:
>>>>>>>
>>>>>>> The programming models for the two frameworks are conceptually
>>>>>>> rather different; I haven't worked with Storm for quite some
time, but
>>>>>>> based on my old experience with it, I would equate Spark Streaming
more
>>>>>>> with Storm's Trident API, rather than with the raw Bolt API.
Even then,
>>>>>>> there are significant differences, but it's a bit closer.
>>>>>>>
>>>>>>> If you can share your use case, we might be able to provide better
>>>>>>> guidance.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Will
>>>>>>>
>>>>>>> On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am evaluating spark VS storm ( spark streaming  ) and i am
not
>>>>>>> able to see what is equivalent of Bolt in storm inside spark.
>>>>>>>
>>>>>>> Any help will be appreciated on this ?
>>>>>>>
>>>>>>> Thanks ,
>>>>>>> Ashish
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>

Mime
View raw message