From user-return-35757-apmail-spark-user-archive=spark.apache.org@spark.apache.org Wed Jun 17 21:22:21 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2852418FD0 for ; Wed, 17 Jun 2015 21:22:21 +0000 (UTC) Received: (qmail 31256 invoked by uid 500); 17 Jun 2015 21:22:17 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 31181 invoked by uid 500); 17 Jun 2015 21:22:17 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 31171 invoked by uid 99); 17 Jun 2015 21:22:17 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jun 2015 21:22:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 840E6CE929 for ; Wed, 17 Jun 2015 21:22:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id j6r8G76Pqa3b for ; Wed, 17 Jun 2015 21:22:02 +0000 (UTC) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 4DAA543D92 for ; Wed, 17 Jun 2015 21:22:02 +0000 (UTC) Received: by wiwd19 with SMTP id d19so4420781wiw.0 for ; Wed, 17 Jun 2015 14:22:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=Nc5WFTlgJ0aTyU8aP612GMK2rk1+hU6s7xmz5mmBuIk=; b=dk2YV4JnbY5s5madT+qAlewmFs/rHxckw6KFl2/VnWAJ4Q2KN3bRpc6AlDMl5YYeAD zq3OZYRR+pgrb6oT2QTYtZaXBesb1F42i+4yyMMff1XSrKRBgwH1YxSc+USIZ85d3rLf V7rA9LmEj9r1Howdq5raoqrnEQBDKhXP1aIEkFuYCSRf3cCkDEsa5A+UqucMbKDoLO7t WwiFXGgQ07UtWArZmGpcWfPH2pOCka/VaJpy463IAaNhb5wqEiql5lI3YgBqonbndR3b +qnjDnV0N2Cyj054ld8gijDSG7KIrmLr8taNDNuJKdbRGHhYj7MAikEIDDIAwBuGEIUV sUvw== X-Gm-Message-State: ALoCoQmc83sH3M9/80dMuFSyygde8MGlo5fKk5FLk/DGkW1OpSgUdDNW2ubxC+oMOsUxsteXlIh4 X-Received: by 10.194.57.109 with SMTP id h13mr9046859wjq.67.1434576121381; Wed, 17 Jun 2015 14:22:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.59.39 with HTTP; Wed, 17 Jun 2015 14:21:30 -0700 (PDT) In-Reply-To: <60FA2818-BED9-4C81-8475-75095F699C1D@gmail.com> References: <632987870.264016.1434530000744.JavaMail.yahoo@mail.yahoo.com> <4679B676-39D9-4EFA-8E2C-0C5AE4E2A87F@gmail.com> <60FA2818-BED9-4C81-8475-75095F699C1D@gmail.com> From: Tathagata Das Date: Wed, 17 Jun 2015 14:21:30 -0700 Message-ID: Subject: Re: Spark or Storm To: Matei Zaharia Cc: Enno Shioji , Ashish Soni , ayan guha , Sabarish Sasidharan , Spark Enthusiast , Will Briggs , user , Sateesh Kavuri Content-Type: multipart/alternative; boundary=047d7ba96ef01e77e30518bd48bf --047d7ba96ef01e77e30518bd48bf Content-Type: text/plain; charset=UTF-8 To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two separate systems for batch and streaming, its much much harder to share the code. You will have to deal with different processing models, with their own semantics. Compare Storm's "join" vs doing an usual batch join, where as Spark and Spark Streaming share the same join semantics as they are based on same RDD model underneath. * Integration with Spark ecosystem - Many people really want to go beyond basic streaming ETL and into advanced streaming analytics. - Combine stream processing with static datasets - Apply dynamically updated machine learning models (i.e. offline learning and online prediction, or even continuous learning and prediction), - Apply DataFrame and SQL operation with streaming These things are pretty easy to do with the spark ecosystem * Operational management - You have to consider the operational cost of managing two separate systems for batch and stream processing (with their own deployment models), vs managing one single engine with one deployment model. * Performance - According to Intel's independent study, Spark Streaming in Kafka direct mode can have 2.5-3x throughput than Trident with 0.5GB batch size. And at that batch size, the latency of Trident is 30 seconds, compared to 1.5 seconds for Spark Streaming. This is from a talk that Intel gave in the Spark Summit (https://spark-summit.org/2015/) two days ago. Slides will be available soon, but here is a sneak peek - throughput - http://i.imgur.com/u6pf4rB.png and latency - http://imgur.com/c46MJ4i I will post the link to the slides when it comes out, hopefully next week. On Wed, Jun 17, 2015 at 11:55 AM, Matei Zaharia wrote: > The major difference is that in Spark Streaming, there's no *need* for a > TridentState for state inside your computation. All the stateful operations > (reduceByWindow, updateStateByKey, etc) automatically handle exactly-once > processing, keeping updates in order, etc. Also, you don't need to run a > separate transactional system (e.g. MySQL) to store the state. > > After your computation runs, if you want to write the final results (e.g. > the counts you've been tracking) to a storage system, you use one of the > output operations (saveAsFiles, foreach, etc). Those actually will run in > order, but some might run multiple times if nodes fail, thus attempting to > write the same state again. > > You can read about how it works in this research paper: > http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf. > > Matei > > On Jun 17, 2015, at 11:49 AM, Enno Shioji wrote: > > Hi Matei, > > > Ah, can't get more accurate than from the horse's mouth... If you don't > mind helping me understand it correctly.. > > From what I understand, Storm Trident does the following (when used with > Kafka): > 1) Sit on Kafka Spout and create batches > 2) Assign global sequential ID to the batches > 3) Make sure that all result of processed batches are written once to > TridentState, *in order* (for example, by skipping batches that were > already applied once, ultimately by using Zookeeper) > > TridentState is an interface that you have to implement, and the > underlying storage has to be transactional for this to work. The necessary > skipping etc. is handled by Storm. > > In case of Spark Streaming, I understand that > 1) There is no global ordering; e.g. an output operation for batch > consisting of offset [4,5,6] can be invoked before the operation for offset > [1,2,3] > 2) If you wanted to achieve something similar to what TridentState does, > you'll have to do it yourself (for example using Zookeeper) > > Is this a correct understanding? > > > > > On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia > wrote: > >> This documentation is only for writes to an external system, but all the >> counting you do within your streaming app (e.g. if you use >> reduceByKeyAndWindow to keep track of a running count) is exactly-once. >> When you write to a storage system, no matter which streaming framework you >> use, you'll have to make sure the writes are idempotent, because the >> storage system can't know whether you meant to write the same data again or >> not. But the place where Spark Streaming helps over Storm, etc is for >> tracking state within your computation. Without that facility, you'd not >> only have to make sure that writes are idempotent, but you'd have to make >> sure that updates to your own internal state (e.g. reduceByKeyAndWindow) >> are exactly-once too. >> >> Matei >> >> >> On Jun 17, 2015, at 8:26 AM, Enno Shioji wrote: >> >> The thing is, even with that improvement, you still have to make updates >> idempotent or transactional yourself. If you read >> http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics >> >> that refers to the latest version, it says: >> >> Semantics of output operations >> >> Output operations (like foreachRDD) have *at-least once* semantics, that >> is, the transformed data may get written to an external entity more than >> once in the event of a worker failure. While this is acceptable for saving >> to file systems using the saveAs***Files operations (as the file will >> simply get overwritten with the same data), additional effort may be >> necessary to achieve exactly-once semantics. There are two approaches. >> >> - >> >> *Idempotent updates*: Multiple attempts always write the same data. >> For example, saveAs***Files always writes the same data to the >> generated files. >> - >> >> *Transactional updates*: All updates are made transactionally so that >> updates are made exactly once atomically. One way to do this would be the >> following. >> - Use the batch time (available in foreachRDD) and the partition >> index of the transformed RDD to create an identifier. This identifier >> uniquely identifies a blob data in the streaming application. >> - Update external system with this blob transactionally (that is, >> exactly once, atomically) using the identifier. That is, if the identifier >> is not already committed, commit the partition data and the identifier >> atomically. Else if this was already committed, skip the update. >> >> >> So either you make the update idempotent, or you have to make it >> transactional yourself, and the suggested mechanism is very similar to what >> Storm does. >> >> >> >> >> On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni >> wrote: >> >>> @Enno >>> As per the latest version and documentation Spark Streaming does offer >>> exactly once semantics using improved kafka integration , Not i have not >>> tested yet. >>> >>> Any feedback will be helpful if anyone is tried the same. >>> >>> http://koeninger.github.io/kafka-exactly-once/#7 >>> >>> >>> https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html >>> >>> >>> >>> On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji wrote: >>> >>>> AFAIK KCL is *supposed* to provide fault tolerance and load balancing >>>> (plus additionally, elastic scaling unlike Storm), Kinesis providing the >>>> coordination. My understanding is that it's like a naked Storm worker >>>> process that can consequently only do map. >>>> >>>> I haven't really used it tho, so can't really comment how it compares >>>> to Spark/Storm. Maybe somebody else will be able to comment. >>>> >>>> >>>> >>>> On Wed, Jun 17, 2015 at 3:13 PM, ayan guha wrote: >>>> >>>>> Thanks for this. It's kcl based kinesis application. But because its >>>>> just a Java application we are thinking to use spark on EMR or storm for >>>>> fault tolerance and load balancing. Is it a correct approach? >>>>> On 17 Jun 2015 23:07, "Enno Shioji" wrote: >>>>> >>>>>> Hi Ayan, >>>>>> >>>>>> Admittedly I haven't done much with Kinesis, but if I'm not mistaken >>>>>> you should be able to use their "processor" interface for that. In this >>>>>> example, it's incrementing a counter: >>>>>> https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java >>>>>> >>>>>> Instead of incrementing a counter, you could do your transformation >>>>>> and send it to HBase. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 17, 2015 at 1:40 PM, ayan guha >>>>>> wrote: >>>>>> >>>>>>> Great discussion!! >>>>>>> >>>>>>> One qs about some comment: Also, you can do some processing with >>>>>>> Kinesis. If all you need to do is straight forward transformation and you >>>>>>> are reading from Kinesis to begin with, it might be an easier option to >>>>>>> just do the transformation in Kinesis >>>>>>> >>>>>>> - Do you mean KCL application? Or some kind of processing >>>>>>> withinKineis? >>>>>>> >>>>>>> Can you kindly share a link? I would definitely pursue this route as >>>>>>> our transformations are really simple. >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni >>>>>> > wrote: >>>>>>> >>>>>>>> My Use case is below >>>>>>>> >>>>>>>> We are going to receive lot of event as stream ( basically Kafka >>>>>>>> Stream ) and then we need to process and compute >>>>>>>> >>>>>>>> Consider you have a phone contract with ATT and every call / sms / >>>>>>>> data useage you do is an event and then it needs to calculate your bill on >>>>>>>> real time basis so when you login to your account you can see all those >>>>>>>> variable as how much you used and how much is left and what is your bill >>>>>>>> till date ,Also there are different rules which need to be considered when >>>>>>>> you calculate the total bill one simple rule will be 0-500 min it is free >>>>>>>> but above it is $1 a min. >>>>>>>> >>>>>>>> How do i maintain a shared state ( total amount , total min , >>>>>>>> total data etc ) so that i know how much i accumulated at any given point >>>>>>>> as events for same phone can go to any node / executor. >>>>>>>> >>>>>>>> Can some one please tell me how can i achieve this is spark as in >>>>>>>> storm i can have a bolt which can do this ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I guess both. In terms of syntax, I was comparing it with Trident. >>>>>>>>> >>>>>>>>> If you are joining, Spark Streaming actually does offer windowed >>>>>>>>> join out of the box. We couldn't use this though as our event stream can >>>>>>>>> grow "out-of-sync", so we had to implement something on top of Storm. If >>>>>>>>> your event streams don't become out of sync, you may find the built-in join >>>>>>>>> in Spark Streaming useful. Storm also has a join keyword but its semantics >>>>>>>>> are different. >>>>>>>>> >>>>>>>>> >>>>>>>>> > Also, what do you mean by "No Back Pressure" ? >>>>>>>>> >>>>>>>>> So when a topology is overloaded, Storm is designed so that it >>>>>>>>> will stop reading from the source. Spark on the other hand, will keep >>>>>>>>> reading from the source and spilling it internally. This maybe fine, in >>>>>>>>> fairness, but it does mean you have to worry about the persistent store >>>>>>>>> usage in the processing cluster, whereas with Storm you don't have to worry >>>>>>>>> because the messages just remain in the data store. >>>>>>>>> >>>>>>>>> Spark came up with the idea of rate limiting, but I don't feel >>>>>>>>> this is as nice as back pressure because it's very difficult to tune it >>>>>>>>> such that you don't cap the cluster's processing power but yet so that it >>>>>>>>> will prevent the persistent storage to get used up. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast < >>>>>>>>> sparkenthusiast@yahoo.in> wrote: >>>>>>>>> >>>>>>>>>> When you say Storm, did you mean Storm with Trident or Storm? >>>>>>>>>> >>>>>>>>>> My use case does not have simple transformation. There are >>>>>>>>>> complex events that need to be generated by joining the incoming event >>>>>>>>>> stream. >>>>>>>>>> >>>>>>>>>> Also, what do you mean by "No Back PRessure" ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wednesday, 17 June 2015 11:57 AM, Enno Shioji < >>>>>>>>>> eshioji@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We've evaluated Spark Streaming vs. Storm and ended up sticking >>>>>>>>>> with Storm. >>>>>>>>>> >>>>>>>>>> Some of the important draw backs are: >>>>>>>>>> Spark has no back pressure (receiver rate limit can alleviate >>>>>>>>>> this to a certain point, but it's far from ideal) >>>>>>>>>> There is also no exactly-once semantics. (updateStateByKey can >>>>>>>>>> achieve this semantics, but is not practical if you have any significant >>>>>>>>>> amount of state because it does so by dumping the entire state on every >>>>>>>>>> checkpointing) >>>>>>>>>> >>>>>>>>>> There are also some minor drawbacks that I'm sure will be fixed >>>>>>>>>> quickly, like no task timeout, not being able to read from Kafka using >>>>>>>>>> multiple nodes, data loss hazard with Kafka. >>>>>>>>>> >>>>>>>>>> It's also not possible to attain very low latency in Spark, if >>>>>>>>>> that's what you need. >>>>>>>>>> >>>>>>>>>> The pos for Spark is the concise and IMO more intuitive syntax, >>>>>>>>>> especially if you compare it with Storm's Java API. >>>>>>>>>> >>>>>>>>>> I admit I might be a bit biased towards Storm tho as I'm more >>>>>>>>>> familiar with it. >>>>>>>>>> >>>>>>>>>> Also, you can do some processing with Kinesis. If all you need to >>>>>>>>>> do is straight forward transformation and you are reading from Kinesis to >>>>>>>>>> begin with, it might be an easier option to just do the transformation in >>>>>>>>>> Kinesis. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan < >>>>>>>>>> sabarish.sasidharan@manthan.com> wrote: >>>>>>>>>> >>>>>>>>>> Whatever you write in bolts would be the logic you want to apply >>>>>>>>>> on your events. In Spark, that logic would be coded in map() or similar >>>>>>>>>> such transformations and/or actions. Spark doesn't enforce a structure for >>>>>>>>>> capturing your processing logic like Storm does. >>>>>>>>>> Regards >>>>>>>>>> Sab >>>>>>>>>> Probably overloading the question a bit. >>>>>>>>>> >>>>>>>>>> In Storm, Bolts have the functionality of getting triggered on >>>>>>>>>> events. Is that kind of functionality possible with Spark streaming? During >>>>>>>>>> each phase of the data processing, the transformed data is stored to the >>>>>>>>>> database and this transformed data should then be sent to a new pipeline >>>>>>>>>> for further processing >>>>>>>>>> >>>>>>>>>> How can this be achieved using Spark? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast < >>>>>>>>>> sparkenthusiast@yahoo.in> wrote: >>>>>>>>>> >>>>>>>>>> I have a use-case where a stream of Incoming events have to be >>>>>>>>>> aggregated and joined to create Complex events. The aggregation will have >>>>>>>>>> to happen at an interval of 1 minute (or less). >>>>>>>>>> >>>>>>>>>> The pipeline is : >>>>>>>>>> send events >>>>>>>>>> enrich event >>>>>>>>>> Upstream services -------------------> KAFKA ---------> event >>>>>>>>>> Stream Processor ------------> Complex Event Processor ------------> >>>>>>>>>> Elastic Search. >>>>>>>>>> >>>>>>>>>> From what I understand, Storm will make a very good ESP and Spark >>>>>>>>>> Streaming will make a good CEP. >>>>>>>>>> >>>>>>>>>> But, we are also evaluating Storm with Trident. >>>>>>>>>> >>>>>>>>>> How does Spark Streaming compare with Storm with Trident? >>>>>>>>>> >>>>>>>>>> Sridhar Chellappa >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wednesday, 17 June 2015 10:02 AM, ayan guha < >>>>>>>>>> guha.ayan@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I have a similar scenario where we need to bring data from >>>>>>>>>> kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of >>>>>>>>>> data will be required but that's regardless of the tool so we will be >>>>>>>>>> writing that piece in Java pojo. >>>>>>>>>> All env is on aws. Hbase is on a long running EMR and kinesis on >>>>>>>>>> a separate cluster. >>>>>>>>>> TIA. >>>>>>>>>> Best >>>>>>>>>> Ayan >>>>>>>>>> On 17 Jun 2015 12:13, "Will Briggs" wrote: >>>>>>>>>> >>>>>>>>>> The programming models for the two frameworks are conceptually >>>>>>>>>> rather different; I haven't worked with Storm for quite some time, but >>>>>>>>>> based on my old experience with it, I would equate Spark Streaming more >>>>>>>>>> with Storm's Trident API, rather than with the raw Bolt API. Even then, >>>>>>>>>> there are significant differences, but it's a bit closer. >>>>>>>>>> >>>>>>>>>> If you can share your use case, we might be able to provide >>>>>>>>>> better guidance. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Will >>>>>>>>>> >>>>>>>>>> On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wrote: >>>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I am evaluating spark VS storm ( spark streaming ) and i am not >>>>>>>>>> able to see what is equivalent of Bolt in storm inside spark. >>>>>>>>>> >>>>>>>>>> Any help will be appreciated on this ? >>>>>>>>>> >>>>>>>>>> Thanks , >>>>>>>>>> Ashish >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Ayan Guha >>>>>>> >>>>>> >>>>>> >>>> >>> >> >> > > --047d7ba96ef01e77e30518bd48bf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
To add more information beyond what Matei said and answer = the original question, here are other things to consider when comparing bet= ween Spark Streaming and Storm.

* Unified programming mo= del and semantics - Most occasions you have to process the same data again = in batch jobs. If you have two separate systems for batch and streaming, it= s much much harder to share the code. You will have to deal with different = processing models, with their own semantics. Compare Storm's "join= " vs doing an usual batch join, where as Spark and Spark Streaming sha= re the same join semantics as they are based on same RDD model underneath.<= /div>

* Integration with Spark ecosystem - Many people r= eally want to go beyond basic streaming ETL and into advanced streaming ana= lytics.
=C2=A0 - Combine stream processing with static datase= ts
=C2=A0 - Apply dynamically updated machine learning models (i.= e. offline learning and online prediction, or even continuous learning and = prediction),=C2=A0
=C2=A0 - Apply DataFrame and SQL operation wit= h streaming=C2=A0
=C2=A0These things are pretty easy to do with t= he spark ecosystem

* Operational management - = You have to consider the operational cost of managing two separate systems = for batch and stream processing (with their own deployment models), vs mana= ging one single engine with one deployment model.

=
* Performance - According to Intel's independent study, Spark Stre= aming in Kafka direct mode can have 2.5-3x throughput than Trident with 0.5= GB batch size. And at that batch size, the latency of Trident is 30 seconds= , compared to 1.5 seconds for Spark Streaming. This is from a talk that Int= el gave in the Spark Summit (htt= ps://spark-summit.org/2015/) two days ago. Slides will be available soo= n, but here is a sneak peek - throughput - http://i.imgur.com/u6pf4rB.png =C2=A0 and latency -=C2=A0http://imgur.com/c46MJ4i=C2=A0
<= div>I will post the link to the slides when it comes out, hopefully next we= ek.



On Wed, Jun 17, 2015 at 11:55 AM, Matei Z= aharia <matei.zaharia@gmail.com> wrote:
The major diff= erence is that in Spark Streaming, there's no *need* for a TridentState= for state inside your computation. All the stateful operations (reduceByWi= ndow, updateStateByKey, etc) automatically handle exactly-once processing, = keeping updates in order, etc. Also, you don't need to run a separate t= ransactional system (e.g. MySQL) to store the state.

After your computation runs, if you want to write the final results (e.g= . the counts you've been tracking) to a storage system, you use one of = the output operations (saveAsFiles, foreach, etc). Those actually will run = in order, but some might run multiple times if nodes fail, thus attempting = to write the same state again.

You can read about = how it works in this research paper:=C2=A0http://= people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf.
<= span class=3D"HOEnZb">

Matei

<= div>On Jun 17, 2015, at 11:49 AM, Enno Shioji <eshioji@gmail.com> wrote:

Hi Matei,


A= h, can't get more accurate than from the horse's mouth... If you do= n't mind helping me understand it correctly..

= >From what I understand, Storm Trident does the following (when used with Ka= fka):
1) Sit on Kafka Spout and create batches
2) Assig= n global sequential ID to the batches
3) Make sure that all resul= t of processed batches are written once to TridentState, in order (f= or example, by skipping batches that were already applied once, ultimately = by using Zookeeper)

TridentState is an interface t= hat you have to implement, and the underlying storage has to be transaction= al for this to work. The necessary skipping etc. is handled by Storm.
=

In case of Spark Streaming, I understand that
1) There is no global ordering; e.g. an output operation for batch consist= ing of offset [4,5,6] can be invoked before the operation for offset [1,2,3= ]
2) If you wanted to achieve something similar to what TridentSt= ate does, you'll have to do it yourself (for example using Zookeeper)

Is this a correct understanding?




On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
This documentation is only for wr= ites to an external system, but all the counting you do within your streami= ng app (e.g. if you use reduceByKeyAndWindow to keep track of a running cou= nt) is exactly-once. When you write to a storage system, no matter which st= reaming framework you use, you'll have to make sure the writes are idem= potent, because the storage system can't know whether you meant to writ= e the same data again or not. But the place where Spark Streaming helps ove= r Storm, etc is for tracking state within your computation. Without that fa= cility, you'd not only have to make sure that writes are idempotent, bu= t you'd have to make sure that updates to your own internal state (e.g.= =C2=A0reduceByKeyAndWindow) are exactly-once too.

Matei


On J= un 17, 2015, at 8:26 AM, Enno Shioji <eshioji@gmail.com> wrote:

The thing is, even with that improvement, you still have to make = updates idempotent or transactional yourself. If you read=C2=A0http://spark.apache.org/docs/latest/s= treaming-programming-guide.html#fault-tolerance-semantics

that refers to the latest version, it says:

=

Semantics of output operations

Output operations (like=C2= =A0foreachRDD) have= =C2=A0at-least once=C2=A0semantics, that is, the transformed data = may get written to an external entity more than once in the event of a work= er failure. While this is acceptable for saving to file systems using the= =C2=A0saveAs***Files= =C2=A0operations (as the file will simply get overwritten with the same dat= a), additional effort may be necessary to achieve exactly-once semantics. T= here are two approaches.

  • Idempotent updates: Multiple attempts always writ= e the same data. For example,=C2=A0saveAs***Files=C2=A0always writes the same data to the genera= ted files.

  • Transactional updates: All updates are made transactionall= y so that updates are made exactly once atomically. One way to do this woul= d be the following.

    • Use the batch time (available in=C2= =A0foreachRDD) and th= e partition index of the transformed RDD to create an identifier. This iden= tifier uniquely identifies a blob data in the streaming application.
    • Update external system with this blob transactionally (that is, exactly o= nce, atomically) using the identifier. That is, if the identifier is not al= ready committed, commit the partition data and the identifier atomically. E= lse if this was already committed, skip the update.
=
=
So either you= make the update idempotent, or you have to make it transactional yourself,= and the suggested mechanism is very similar to what Storm does.


<= /span>
=

On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni <asoni.learn= @gmail.com> wrote:
@Enno=C2=A0
As per the latest version and documentation Spark = Streaming does offer exactly once semantics using improved kafka integratio= n , Not i have not tested yet.

Any feedback will b= e helpful if anyone is tried the same.





On Wed, Jun 17, 2015 at 10:33 AM, = Enno Shioji <eshioji@gmail.com> wrote:
AFAIK KCL is *supposed* to provide fault tol= erance and load balancing (plus additionally, elastic scaling unlike Storm)= , Kinesis providing the coordination. My understanding is that it's lik= e a naked Storm worker process that can consequently only do map.

<= div>I haven't really used it tho, so can't really comment how it co= mpares to Spark/Storm. Maybe somebody else will be able to comment.



On Wed, Jun 17, 2015 at 3:13 PM, ayan = guha <guha.ayan@gmail.com> wrote:

Thanks for this. It's kcl based kinesis applic= ation. But because its just a Java application we are thinking to use spark= on EMR or storm for fault tolerance and load balancing. Is it a correct ap= proach?

On 17 Jun 2015 23:07, "Enno Shioji" &l= t;eshioji@gmail.com<= /a>> wrote:





On Wed, Jun 1= 7, 2015 at 1:40 PM, ayan guha <guha.ayan@gmail.com> wrote:=
Great di= scussion!!

One qs about some comment: Also, you can do some pr= ocessing with Kinesis. If all you need to do is=20 straight forward transformation and you are reading from Kinesis to=20 begin with, it might be an easier option to just do the transformation=20 in Kinesis

- Do you mean KCL application? Or some kind of proc= essing withinKineis?

Can you kindly share a link? I would definitel= y pursue this route as our transformations are really simple.

= Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni <= asoni.learn@gmai= l.com> wrote:
My Use case is below=C2=A0

We are going to receive l= ot of event as stream ( basically Kafka Stream ) and then we need to proces= s and compute=C2=A0

Consider you have a phone cont= ract with ATT and every call / sms / data useage you do is an event and the= n it needs =C2=A0to calculate your bill on real time basis so when you logi= n to your account you can see all those variable as how much you used and h= ow much is left and what is your bill till date ,Also there are different r= ules which need to be considered when you calculate the total bill one simp= le rule will be 0-500 min it is free but above it is $1 a min.
How do i maintain a shared state =C2=A0( total amount , total = min , total data etc ) so that i know how much i accumulated at any given p= oint as events for same phone can go to any node / executor.=C2=A0

Can some one please tell me how can i achieve this is spar= k as in storm i can have a bolt which can do this ?

Thanks,

=C2=A0

On Wed, Jun 17, 2015 at 4:5= 2 AM, Enno Shioji <eshioji@gmail.com> wrote:
I guess both. In terms of syntax, I was= comparing it with Trident.

If you are joining, Spark St= reaming actually does offer windowed join out of the box. We couldn't u= se this though as our event stream can grow "out-of-sync", so we = had to implement something on top of Storm. If your event streams don't= become out of sync, you may find the built-in join in Spark Streaming usef= ul. Storm also has a join keyword but its semantics are different.


>=C2=A0Also, what do you mean by "No Back Pressure&q= uot; ?

So when a topology is overloaded, St= orm is designed so that it will stop reading from the source. Spark on the = other hand, will keep reading from the source and spilling it internally. T= his maybe fine, in fairness, but it does mean you have to worry about the p= ersistent store usage in the processing cluster, whereas with Storm you don= 't have to worry because the messages just remain in the data store.

Spark came up with the idea of rate limiting, but I = don't feel this is as nice as back pressure because it's very diffi= cult to tune it such that you don't cap the cluster's processing po= wer but yet so that it will prevent the persistent storage to get used up.<= /div>


On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast <sparkenthusiast@yahoo.in> wrote:
When you say Storm, did you mean Storm w= ith Trident or Storm?

My use case does not have si= mple transformation. There are complex events that need to be generated by = joining the incoming event stream.

Als= o, what do you mean by "No Back PRessure" ?





On Wedn= esday, 17 June 2015 11:57 AM, Enno Shioji <eshioji@gmail.com> wrote:


We've evaluated Spark Stre= aming vs. Storm and ended up sticking with Storm.

Some of the important draw backs are:
= Spark has no back pressure (receiver rate limit can alleviate this to a cer= tain point, but it's far from ideal)
There is also no exactly= -once=C2=A0semantics. (updateSt= ateByKey can achieve this semantics, but is not practical if you have any s= ignificant amount of state because it does so by dumping the entire state o= n every checkpointing)

There are also some minor drawbacks that I'm sure will be = fixed quickly, like no task timeout, not being able to read from Kafka usin= g multiple nodes, data loss hazard with Kafka.

It's also not possible to attain very low latency in Spark,= if that's what you need.

The p= os for Spark is the concise and IMO more intuitive syntax, especially if yo= u compare it with Storm's Java API.

=
I admit I might be a bit biased towards Storm tho as I'm more fami= liar with it.

Also, you can do some= processing with Kinesis. If all you need to do is straight forward transfo= rmation and you are reading from Kinesis to begin with, it might be an easi= er option to just do the transformation in Kinesis.





On= Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan <sabarish.sasidharan@manthan.com> wrot= e:
Whatever you write in bolts = would be the logic you want to apply on your events. In Spark, that logic w= ould be coded in map() or similar such=C2=A0 transformations and/or actions= . Spark doesn't enforce a structure for capturing your processing logic= like Storm does.
Regards
Sab
Probably overloading the question a bit.

In Storm, Bolts have the functionalit= y of getting triggered on events. Is that kind of functionality possible wi= th Spark streaming? During each phase of the data processing, the transform= ed data is stored to the database and this transformed data should then be = sent to a new pipeline for further processing

How can this be achieved using Spark?


On We= d, Jun 17, 2015 at 10:10 AM, Spark Enthusiast <sparkenthusiast@yahoo.in> wrote:
I have a use-case w= here a stream of Incoming events have to be aggregated and joined to create= Complex events. The aggregation will have to happen at an interval of 1 mi= nute (or less).

The pipeline is :
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 send events =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0enrich event
Upstream services -------------------> KAFKA ---------> event St= ream Processor ------------> Complex Event Processor ------------> El= astic Search.

From what I understand, Storm will make a very good ESP and Spark Stream= ing will make a good CEP.

But, we are also evaluating Storm with Trident.

How does Spark Streamin= g compare with Storm with Trident?

Sridhar Chellappa
<= br clear=3D"none">


=C2=A0


On Wednesday, 17 June 2015 10:02 AM, ayan guha <guha.a= yan@gmail.com> wrote:


I have a simi= lar scenario where we need to bring data from kinesis to hbase. Data voleci= ty is 20k per 10 mins. Little manipulation of data will be required but tha= t's regardless of the tool so we will be writing that piece in Java poj= o.
All env is on aws. Hbase is on a long running EMR and kine= sis on a separate cluster.
TIA.
Best
Ayan
On 17 Jun 2015 12:13, "Will Briggs" <wr= briggs@gmail.com> wrote:
The programming = models for the two frameworks are conceptually rather different; I haven= 9;t worked with Storm for quite some time, but based on my old experience w= ith it, I would equate Spark Streaming more with Storm's Trident API, r= ather than with the raw Bolt API. Even then, there are significant differen= ces, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance= .

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wro= te:

Hi All,

I am evaluating spark VS storm ( spark streaming=C2=A0 ) and i am not able = to see what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.= apache.org
For additional commands, e-mail: user-help@spark.apach= e.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.= apache.org
For additional commands, e-mail: user-help@spark.apach= e.org



=

<= br>





= --
Best Regards,
Ayan Guha








--047d7ba96ef01e77e30518bd48bf--