spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheel Pancholi <sheelst...@gmail.com>
Subject Re: Are Spark Dataframes mutable in Structured Streaming?
Date Thu, 16 May 2019 09:14:00 GMT
Hello Russell,

Thanks for clarifying. I went over the Catalyst Optimizer Deep Dive video
at https://www.youtube.com/watch?v=RmUn5vHlevc and that along with your
explanation made me realize that the the DataFrame is the new DStream in
Structured Streaming. If my understanding is correct, request you to
clarify the 2 points below:

1. Incremental Query - Say at time instant T1 you have 10 items to process,
and then at time instant T2 you have 5 newer items streaming in to be
processed. *Structured Streaming* says that the DF is treated as an
unbounded table and hence 15 items will be processed together. Does this
mean that on Iteration 1 (i.e. time instant T1) the Catalyst Optimizer in
the code generation phase creates an RDD of 10 elements and on Iteration 2
( i.e. time instant T2 ), the Catalyst Optimizer creates an RDD of 15
elements?
2. You mentioned *"Some parts of the plan refer to static pieces of data
..."*  Could you elaborate a bit more on what does this static piece of
data refer to? Are you referring to the 10 records that had already arrived
at T1 and are now sitting as old static data in the unbounded table?

Regards
Sheel


On Thu, May 16, 2019 at 3:30 AM Russell Spitzer <russell.spitzer@gmail.com>
wrote:

> Dataframes describe the calculation to be done, but the underlying
> implementation is an "Incremental Query". That is that the dataframe code
> is executed repeatedly with Catalyst adjusting the final execution plan on
> each run. Some parts of the plan refer to static pieces of data, others
> refer to data which is pulled in on each iteration. None of this changes
> the DataFrame objects themselves.
>
>
>
>
> On Wed, May 15, 2019 at 1:34 PM Sheel Pancholi <sheelstera@gmail.com>
> wrote:
>
>> Hi
>> Structured Streaming treats a stream as an unbounded table in the form of
>> a DataFrame. Continuously flowing data from the stream keeps getting added
>> to this DataFrame (which is the unbounded table) which warrants a change to
>> the DataFrame which violates the vary basic nature of a DataFrame since a
>> DataFrame by its nature is immutable. This sounds contradictory. Is there
>> an explanation for this?
>>
>> Regards
>> Sheel
>>
>

-- 

Best Regards,

Sheel Pancholi

*Mob: +91 9620474620*

*Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
<http://in.linkedin.com/in/sheelstera>*

*Write to me at:Sheel@Yahoo!! <sheelpancholi@yahoo.com> | Sheel@Gmail!!
<sheelstera@gmail.com> | Sheel@Windows Live <sheelstera@live.com>*

P *Save a tree* - please do not print this email unless you really need to!

Mime
View raw message