spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Massimiliano Tomassi <max.toma...@gmail.com>
Subject Re: Dstream Transformations
Date Mon, 06 Oct 2014 09:10:56 GMT
>From the Spark Streaming Programming Guide (
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node
):

*...output operations (like foreachRDD) have at-least once semantics, that
is, the transformed data may get written to an external entity more than
once in the event of a worker failure.*

I think that when a worker fails the entire graph of
transformations/actions will be reapplied again on that RDD. This means
that, in your case, both the storing operations will be executed again. For
this reason, in a video I've watched on youtube, they suggest to make all
the output operations idempotent. Obviously not always this is possible
unfortunately: e.g. you are building an analytics system and you need to
increment counters.

This is what I've got so far, anyone having a different point of view?

On 6 October 2014 08:59, Jahagirdar, Madhu <madhu.jahagirdar@philips.com>
wrote:

>  Given that I have multiple worker nodes and when Spark schedules the job
> again on the worker nodes that are alive, does it then again store the
> data in elastic search and then flume or does it only run functions to
> store in flume ?
>
>  Regards,
> Madhu Jahagirdar
>
>  ------------------------------
> *From:* Akhil Das [akhil@sigmoidanalytics.com]
> *Sent:* Monday, October 06, 2014 1:20 PM
> *To:* Jahagirdar, Madhu
> *Cc:* user
> *Subject:* Re: Dstream Transformations
>
>    AFAIK spark doesn't restart worker nodes itself. You can have multiple
> worker nodes and in that case if one worker node goes down, then spark will
> try to recompute those lost RDDs again with those workers who are alive.
>
>  Thanks
> Best Regards
>
> On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu <
> madhu.jahagirdar@philips.com> wrote:
>
>> In my spark streaming program I have created kafka utils to receive data
>> and store data in elastic search and in flume. Storing function is applied
>> on same dstream. My question what is the behavior of spark if after storing
>> data in elastic search the worker node dies before storing in flume? Does
>> it  restart worker and then again store the data in elastic search and then
>> flume or does it only run functions to store in flume.
>>
>> Regards
>> Madhu Jahagirdar
>>
>> ________________________________
>> The information contained in this message may be confidential and legally
>> protected under applicable law. The message is intended solely for the
>> addressee(s). If you are not the intended recipient, you are hereby
>> notified that any use, forwarding, dissemination, or reproduction of this
>> message is strictly prohibited and may be unlawful. If you are not the
>> intended recipient, please contact the sender by return e-mail and destroy
>> all copies of the original message.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>


-- 
------------------------------------------------
Massimiliano Tomassi
------------------------------------------------
web: http://about.me/maxtomassi
e-mail: max.tomassi@gmail.com
------------------------------------------------

Mime
View raw message