spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Are all transformations lazy?
Date Wed, 12 Mar 2014 05:28:08 GMT
The only point where some *actual* computation happens is when data is
requested by driver (using collect()) or materialized in external storage
(ex: saveashadoopfile). Rest of the time operations are merely stored &
saved. Once you actually ask for the data, the operations are compiled into
a DAG of stages. Each stage can contain multiple tasks (like 2 filter
operations can be combined into one stage) & executed. Hence the operations
are all lazy by default.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Mar 11, 2014 at 10:15 PM, David Thomas <dt5434884@gmail.com> wrote:

> I think you misunderstood my question - I should have stated it better.
> I'm not saying it should be applied immediately, but I'm trying to
> understand how Spark achieves this lazy computation transformations. May be
> this is due to my ignorance of how Scala works, but when I see the code, I
> see that the function is applied to the elements of RDD when I call
> distinct - or is it not applied immediately? How does the returned RDD
> 'keep track of the operation'?
>
>
> On Tue, Mar 11, 2014 at 10:06 PM, Ewen Cheslack-Postava <me@ewencp.org>wrote:
>
>> You should probably be asking the opposite question: why do you think it
>> *should* be applied immediately? Since the driver program hasn't requested
>> any data back (distinct generates a new RDD, it doesn't return any data),
>> there's no need to actually compute anything yet.
>>
>> As the documentation describes, if the call returns an RDD, it's
>> transforming the data and will just keep track of the operation it
>> eventually needs to perform. Only methods that return data back to the
>> driver should trigger any computation.
>>
>> (The one known exception is sortByKey, which really should be lazy, but
>> apparently uses an RDD.count call in its implementation:
>> https://spark-project.atlassian.net/browse/SPARK-1021).
>>
>>   David Thomas <dt5434884@gmail.com>
>>  March 11, 2014 at 9:49 PM
>> For example, is distinct() transformation lazy?
>>
>> when I see the Spark source code, distinct applies a map-> reduceByKey ->
>> map function to the RDD elements. Why is this lazy? Won't the function be
>> applied immediately to the elements of RDD when I call someRDD.distinct?
>>
>>   /**
>>    * Return a new RDD containing the distinct elements in this RDD.
>>    */
>>   def distinct(numPartitions: Int): RDD[T] =
>>     map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
>>
>>   /**
>>    * Return a new RDD containing the distinct elements in this RDD.
>>    */
>>   def distinct(): RDD[T] = distinct(partitions.size)
>>
>>
>

Mime
View raw message