From user-return-35863-apmail-spark-user-archive=spark.apache.org@spark.apache.org Thu Jun 18 14:08:12 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93C79186E2 for ; Thu, 18 Jun 2015 14:08:12 +0000 (UTC) Received: (qmail 8091 invoked by uid 500); 18 Jun 2015 14:08:08 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 8016 invoked by uid 500); 18 Jun 2015 14:08:08 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 8006 invoked by uid 99); 18 Jun 2015 14:08:08 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jun 2015 14:08:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 41292CEBC7 for ; Thu, 18 Jun 2015 14:08:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.201 X-Spam-Level: ***** X-Spam-Status: No, score=5.201 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LINEPADDING=1.2, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id f4aRdLcFFznK for ; Thu, 18 Jun 2015 14:07:56 +0000 (UTC) Received: from mail-oi0-f53.google.com (mail-oi0-f53.google.com [209.85.218.53]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 5D40A20DC7 for ; Thu, 18 Jun 2015 14:07:55 +0000 (UTC) Received: by oigx81 with SMTP id x81so58169515oig.1 for ; Thu, 18 Jun 2015 07:07:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=JBG2cvjDea+WZ4s4+8GoKN+HdL5d++ck5oqtl7iu4OU=; b=L1VVNgsTymQ349AL9oxM7KlpJKUpqbWTlQR/5uk8zOlmth/WYMJHw+rdph3cVAe7+W Vck4RIRtqR51tu76InyZ/970zI8DFk5G2cEJ0cA3AW/3s+WuvEWKi+uyD7TX5ANOTaY6 w05hWxE41GvHGrrAy6/3NI9wNL3E4NGP366Sbx+pvzMDLliE7dz+BCOA1tqTBpauYlcZ X8kct8J409BXhB7+Ut4gxZYCyi8mc5PIVDJ49f6bbNiduo1dqfEU0f3nHHnX/khjcqQ1 AEQ1FkT6Y06IzZ+rcNBT5ITaIIC/8jedZ1LlQTdycgeaMrlkpnf+krdmWIUW20Z2qmFX BE8Q== X-Gm-Message-State: ALoCoQn2FO0UHVHnRXzFl71SpB/ml6O9CUcgjf6cYQVFLkXCcITQaeZDkOadFnd5svgH7bTbno4T MIME-Version: 1.0 X-Received: by 10.60.157.202 with SMTP id wo10mr9321541oeb.20.1434636474179; Thu, 18 Jun 2015 07:07:54 -0700 (PDT) Received: by 10.76.85.102 with HTTP; Thu, 18 Jun 2015 07:07:53 -0700 (PDT) In-Reply-To: <2015061819473171364656@163.com> References: <1675029857.117572.1434516028884.JavaMail.yahoo@mail.yahoo.com> <2015061819473171364656@163.com> Date: Thu, 18 Jun 2015 09:07:53 -0500 Message-ID: Subject: Re: RE: Spark or Storm From: Cody Koeninger To: "bit1129@163.com" Cc: "prajod.vettiyattil@wipro.com" , "jrpilat@gmail.com" , "eshioji@gmail.com" , "wrbriggs@gmail.com" , "asoni.learn@gmail.com" , ayan guha , user , "sateesh.kavuri@gmail.com" , "sparkenthusiast@yahoo.in" , "sabarish.sasidharan@manthan.com" Content-Type: multipart/alternative; boundary=047d7bd6c5e86d072c0518cb55a5 --047d7bd6c5e86d072c0518cb55a5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For "some unique id", topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1129@163.com wrote: > I am wondering how direct stream api ensures end-to-end exactly once > semantics > > I think there are two things involved: > 1. From the spark streaming end, the driver will replay the Offset range > when it's down and restarted,which means that the new tasks will process > some already processed data. > 2. From the user end, since tasks may process already processed data, use= r > end should detect that some data has already been processed,eg, > use some unique ID. > > Not sure if I have understood correctly. > > > ------------------------------ > bit1129@163.com > > > *From:* prajod.vettiyattil@wipro.com > *Date:* 2015-06-18 16:56 > *To:* jrpilat@gmail.com; eshioji@gmail.com > *CC:* wrbriggs@gmail.com; asoni.learn@gmail.com; guha.ayan@gmail.com; > user@spark.apache.org; sateesh.kavuri@gmail.com; sparkenthusiast@yahoo.in= ; > sabarish.sasidharan@manthan.com > *Subject:* RE: Spark or Storm > > >>not being able to read from Kafka using multiple nodes > > > > > Kafka is plenty capable of doing this.. > > > > I faced the same issue before Spark 1.3 was released. > > > > The issue was not with Kafka, but with Spark Streaming=E2=80=99s Kafka co= nnector. > Before Spark 1.3.0 release one Spark worker would get all the streamed > messages. We had to re-partition to distribute the processing. > > > > From Spark 1.3.0 release the Spark Direct API for Kafka supported paralle= l > reads from Kafka streamed to Spark workers. See the =E2=80=9CApproach 2: = Direct > Approach=E2=80=9D in this page: > http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note > that is also mentions zero data loss and exactly once semantics for kafka > integration. > > > > > > Prajod > > > > *From:* Jordan Pilat [mailto:jrpilat@gmail.com] > *Sent:* 18 June 2015 03:57 > *To:* Enno Shioji > *Cc:* Will Briggs; asoni.learn@gmail.com; ayan guha; user; Sateesh > Kavuri; Spark Enthusiast; Sabarish Sasidharan > *Subject:* Re: Spark or Storm > > > > >not being able to read from Kafka using multiple nodes > > Kafka is plenty capable of doing this, by clustering together multiple > consumer instances into a consumer group. > If your topic is sufficiently partitioned, the consumer group can consume > the topic in a parallelized fashion. > If it isn't, you still have the fault tolerance associated with clusterin= g > the consumers. > > OK > JRP > > On Jun 17, 2015 1:27 AM, "Enno Shioji" wrote: > > We've evaluated Spark Streaming vs. Storm and ended up sticking with > Storm. > > > > Some of the important draw backs are: > > Spark has no back pressure (receiver rate limit can alleviate this to a > certain point, but it's far from ideal) > > There is also no exactly-once semantics. (updateStateByKey can achieve > this semantics, but is not practical if you have any significant amount o= f > state because it does so by dumping the entire state on every checkpointi= ng) > > > > There are also some minor drawbacks that I'm sure will be fixed quickly, > like no task timeout, not being able to read from Kafka using multiple > nodes, data loss hazard with Kafka. > > > > It's also not possible to attain very low latency in Spark, if that's wha= t > you need. > > > > The pos for Spark is the concise and IMO more intuitive syntax, especiall= y > if you compare it with Storm's Java API. > > > > I admit I might be a bit biased towards Storm tho as I'm more familiar > with it. > > > > Also, you can do some processing with Kinesis. If all you need to do is > straight forward transformation and you are reading from Kinesis to begin > with, it might be an easier option to just do the transformation in Kines= is. > > > > > > > > > > > > On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan < > sabarish.sasidharan@manthan.com> wrote: > > Whatever you write in bolts would be the logic you want to apply on your > events. In Spark, that logic would be coded in map() or similar such > transformations and/or actions. Spark doesn't enforce a structure for > capturing your processing logic like Storm does. > > Regards > Sab > > Probably overloading the question a bit. > > In Storm, Bolts have the functionality of getting triggered on events. Is > that kind of functionality possible with Spark streaming? During each pha= se > of the data processing, the transformed data is stored to the database an= d > this transformed data should then be sent to a new pipeline for further > processing > > How can this be achieved using Spark? > > > > On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast < > sparkenthusiast@yahoo.in> wrote: > > I have a use-case where a stream of Incoming events have to be > aggregated and joined to create Complex events. The aggregation will have > to happen at an interval of 1 minute (or less). > > > > The pipeline is : > > send events > enrich event > > Upstream services -------------------> KAFKA ---------> event Stream > Processor ------------> Complex Event Processor ------------> Elastic > Search. > > > > From what I understand, Storm will make a very good ESP and Spark > Streaming will make a good CEP. > > > > But, we are also evaluating Storm with Trident. > > > > How does Spark Streaming compare with Storm with Trident? > > > > Sridhar Chellappa > > > > > > > > > > > > > > On Wednesday, 17 June 2015 10:02 AM, ayan guha > wrote: > > > > I have a similar scenario where we need to bring data from kinesis to > hbase. Data volecity is 20k per 10 mins. Little manipulation of data will > be required but that's regardless of the tool so we will be writing that > piece in Java pojo. > > All env is on aws. Hbase is on a long running EMR and kinesis on a > separate cluster. > > TIA. > Best > Ayan > > On 17 Jun 2015 12:13, "Will Briggs" wrote: > > The programming models for the two frameworks are conceptually rather > different; I haven't worked with Storm for quite some time, but based on = my > old experience with it, I would equate Spark Streaming more with Storm's > Trident API, rather than with the raw Bolt API. Even then, there are > significant differences, but it's a bit closer. > > If you can share your use case, we might be able to provide better > guidance. > > Regards, > Will > > On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wrote: > > Hi All, > > I am evaluating spark VS storm ( spark streaming ) and i am not able to > see what is equivalent of Bolt in storm inside spark. > > Any help will be appreciated on this ? > > Thanks , > Ashish > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org > For additional commands, e-mail: user-help@spark.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org > For additional commands, e-mail: user-help@spark.apache.org > > > > > > > > The information contained in this electronic message and any attachments > to this message are intended for the exclusive use of the addressee(s) an= d > may contain proprietary, confidential or privileged information. If you a= re > not the intended recipient, you should not disseminate, distribute or cop= y > this e-mail. Please notify the sender immediately and destroy all copies = of > this message and any attachments. WARNING: Computer viruses can be > transmitted via email. The recipient should check this email and any > attachments for the presence of viruses. The company accepts no liability > for any damage caused by any virus transmitted by this email. > www.wipro.com > > --047d7bd6c5e86d072c0518cb55a5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
That general description is accurate, but not really a spe= cific issue of the direct steam.=C2=A0 It applies to anything consuming fro= m kafka (or, as Matei already said, any streaming system really).=C2=A0 You= can't have exactly once semantics, unless you know something more abou= t how you're storing results.

For "some unique = id", topicpartition and offset is usually the obvious choice, which is= why it's important that the direct stream gives you access to the offs= ets.




On Thu, Jun 18, 2015 at 6:47 AM, bit1129@163.com <bit1129@163.com><= /span> wrote:
I am wondering how direct stream api ensures end-to-end exactly once s= emantics

I think there are two things involved:
1. From the spark streaming end, the driver will repla= y the Offset range when it's down and restarted,which means that the ne= w tasks will process some already processed data.
2. From the use= r end, since tasks may process already processed data, user end should dete= ct that some data has already been processed,eg,
use some unique = ID.

Not sure if I have understood correctly.
=



=C2=A0
Date:=C2=A020= 15-06-18=C2=A016:56
=
CC:=C2=A0wrbriggs@gmail.com; asoni.learn@gmail.com; guha.ayan@gmail.com; user@sp= ark.apache.org; sateesh.kavuri@gmail.= com; sparkenthusiast@yahoo.in; sabarish.sasidharan@manthan.com<= /div>
Subject:=C2=A0RE: Spark or Storm

>>not being able to read fro= m Kafka using multiple nodes

=C2=A0

> Kafka i= s plenty capable of doing this..

=C2=A0

I faced the same is= sue before Spark 1.3 was released.

=C2=A0

The issue was not w= ith Kafka, but with Spark Streaming=E2=80=99s Kafka connector. Before Spark= 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing.

=C2=A0

From Spark 1.3.0 re= lease the Spark Direct API for Kafka supported parallel reads from Kafka st= reamed to Spark workers. See the =E2=80=9CApproach 2: Direct Approach=E2=80= =9D in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics = for kafka integration.

=C2=A0

=C2=A0

Prajod

=C2=A0

From: Jordan Pilat= [mailto:jrpilat@gma= il.com]
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; asoni.learn@gmail.com; ayan guha; user; Sateesh Kavuri; Spark E= nthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm

=C2=A0

>not being able to read from Kafka using m= ultiple nodes

Kafka is plenty capable of doing this,=C2=A0 = by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume t= he topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with cluster= ing the consumers.

OK
JRP

On Jun 17, 2015 1:27 AM, "Enn= o Shioji" <eshioji@gmail.com> wro= te:

We've evaluated Spark Streamin= g vs. Storm and ended up sticking with Storm.

=C2=A0

Some of the important draw backs a= re:

Spark has no back pressure (receiv= er rate limit can alleviate this to a certain point, but it's far from = ideal)

There is also no exactly-once=C2= =A0semantics. (updateStateByKey can achieve this semantics, but is not prac= tical if you have any significant amount of state because it does so by dum= ping the entire state on every checkpointing)

=C2=A0

There are also some minor drawback= s that I'm sure will be fixed quickly, like no task timeout, not being = able to read from Kafka using multiple nodes, data loss hazard with Kafka.<= u>

=C2=A0

It's also not possible to atta= in very low latency in Spark, if that's what you need.

=C2=A0

The pos for Spark is the concise a= nd IMO more intuitive syntax, especially if you compare it with Storm's= Java API.

=C2=A0

I admit I might be a bit biased to= wards Storm tho as I'm more familiar with it.

=C2=A0

Also, you can do some processing w= ith Kinesis. If all you need to do is straight forward transformation and y= ou are reading from Kinesis to begin with, it might be an easier option to = just do the transformation in Kinesis.

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

On Wed, Jun 17, 2015 at 7:15 AM, S= abarish Sasidharan <sabarish.sa= sidharan@manthan.com> wrote:

Whatever you write in bolts would be the logi= c you want to apply on your events. In Spark, that logic would be coded in = map() or similar such=C2=A0 transformations and/or actions. Spark doesn'= ;t enforce a structure for capturing your processing logic like Storm does.

Regards
Sab

Probably overloading the question = a bit.

In Storm, Bolts have the functiona= lity of getting triggered on events. Is that kind of functionality possible= with Spark streaming? During each phase of the data processing, the transf= ormed data is stored to the database and this transformed data should then be sent to a new pipeli= ne for further processing

How can this be achieved using Spa= rk?

=C2=A0

On Wed, Jun 17, 2015 at 10:10 AM, = Spark Enthusiast <sparkenthusiast@yaho= o.in> wrote:

I have a use-case where a stream of Incoming events ha= ve to be aggregated and joined to create Complex events. The aggregation wi= ll have to happen at an interval of 1 minute (or less).

=C2=A0

The pipeline is :

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 send eve= nts =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0enrich event

Upstream services -------------------> KAFKA ------= ---> event Stream Processor ------------> Complex Event Processor ---= ---------> Elastic Search.

=C2=A0

From what I understand, Storm will make a very good ES= P and Spark Streaming will make a good CEP.

=C2=A0

But, we are also evaluating Storm with Trident.=

=C2=A0

How does Spark Streaming compare with Storm with Tride= nt?

=C2=A0

Sridhar Chellappa

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

On Wednesday, 17 June 2015 10:02 AM, ayan= guha <guha.ayan@gmail.com> wrote:

=C2=A0

I have a similar scenario where we need to bring data = from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulatio= n of data will be required but that's regardless of the tool so we will be writing that piece in = Java pojo.

All env is on aws. Hbase is on a long running EMR and = kinesis on a separate cluster.

TIA.
Best
Ayan

On 17 Jun 2015 12:13, "Will Briggs" <wrbriggs@gmail.com> wrote:

The programming models for the two frameworks are conc= eptually rather different; I haven't worked with Storm for quite some t= ime, but based on my old experience with it, I would equate Spark Streaming mor= e with Storm's Trident API, rather than with the raw Bolt API. Even the= n, there are significant differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance= .

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.learn@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming=C2=A0 ) and i am not able = to see what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

=C2=A0

=C2=A0

=C2=A0

The information contained in this electronic message and any attachments to= this message are intended for the exclusive use of the addressee(s) and ma= y contain proprietary, confidential or privileged information. If you are n= ot the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender = immediately and destroy all copies of this message and any attachments. WAR= NING: Computer viruses can be transmitted via email. The recipient should c= heck this email and any attachments for the presence of viruses. The company accepts no liability for any dama= ge caused by any virus transmitted by this email. www.wipro.com

--047d7bd6c5e86d072c0518cb55a5--