spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Lifecycle of RDD in spark-streaming
Date Thu, 27 Nov 2014 00:02:34 GMT
Let me further clarify Lalit's point on when RDDs generated by
DStreams are destroyed, and hopefully that will answer your original
questions.

1.  How spark (streaming) guarantees that all the actions are taken on
each input rdd/batch.
This is isnt hard! By the time you call streamingContext.start(), you
have already set up the output operations (foreachRDD, saveAs***Files,
etc.) that you want to do with the DStream. There are RDD actions
inside the DStream output oeprations that need to be done every batch
interval. So all the systems does is this - after every batch
interval, put all the output operations (that will call RDD actions)
in a job queue, and then keep executing stuff in the queue. If there
is any failure in running the jobs, the streaming context will stop.

2.  How does spark determines that the life-cycle of a rdd is
complete. Is there any chance that a RDD will be cleaned out of ram
before all actions are taken on them?
Spark Streaming knows when the all the processing related to batch T
has been completed. And also it keeps track of how much time of the
previous RDDs does it need to remember and keep around in the cache
based on what DStream operations have been done. For example, if you
are using a window 1 minute, the system knows that it needs to keep
around at least last 1 minute data in the memory. Accordingly, it
cleans up the input data (actively unpersisted), and cached RDD
(simply dereferenced from DStream metadata, and then Spark unpersists
them as the RDD object gets GarbageCollected by the JVM).

TD



On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
<tzhang101@yahoo.com.invalid> wrote:
> I have found this paper seems to answer most of questions about life
> duration.
> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>
> Tian
>
>
> On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha <me.mukesh.jha@gmail.com>
> wrote:
>
>
> Hey Experts,
>
> I wanted to understand in detail about the lifecycle of rdd(s) in a
> streaming app.
>
> From my current understanding
> - rdd gets created out of the realtime input stream.
> - Transform(s) functions are applied in a lazy fashion on the RDD to
> transform into another rdd(s).
> - Actions are taken on the final transformed rdds to get the data out of the
> system.
>
> Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
> cleaned in LRU fashion.
>
> So I have the following questions on the same.
> - How spark (streaming) guarantees that all the actions are taken on each
> input rdd/batch.
> - How does spark determines that the life-cycle of a rdd is complete. Is
> there any chance that a RDD will be cleaned out of ram before all actions
> are taken on them?
>
> Thanks in advance for all your help. Also, I'm relatively new to scala &
> spark so pardon me in case these are naive questions/assumptions.
>
> --
> Thanks & Regards,
> Mukesh Jha
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message