spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahdi Namazifar <mahdi.namazi...@gmail.com>
Subject Re: A chain of lazy operations starts running tasks
Date Tue, 24 Sep 2013 03:31:50 GMT
Thanks for your response!


On Mon, Sep 23, 2013 at 7:42 PM, Reynold Xin <rxin@cs.berkeley.edu> wrote:

> The reason is sortByKey triggers a sample operation to determine the range
> partitioner.
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Mon, Sep 23, 2013 at 5:47 PM, Mahdi Namazifar <
> mahdi.namazifar@gmail.com> wrote:
>
>> Hi,
>>
>> I think I might be missing something but here is what I observe which is
>> inconsistent with my understanding of transformation vs action operations:
>>  in the Spark shell I do the following
>>
>> val a = sc.textFile("[my file]", 1000)
>> val c = a.flatMap(line => line.split("\t")).map(word => (word,
>> 1)).reduceByKey((a,b)=>a+b, 100).sortByKey(false,500)
>>
>> which is for experimentation purposes only and I'm running a word count
>> on a file that is read from HDFS and then I sort the result by the words.
>>
>> My understanding from the documentation is that all of flatMap, map,
>> reduceByKey, and sortByKey are transformation operations and are therefore
>> lazy operations.  But when I run the second line, I see 1000
>> ShuffleMapTasks, followed by 100 ResultTasks and another 100 ResultTasks
>> running on the cluster which in total take about 400 seconds.  Am I missing
>> something?  Could someone kindly explain to me what exactly happens when I
>> run the second command, because I was expecting for the command to only
>> create an RDD and not perform any tasks.
>>
>> BTW, I'm using Spark 0.7.2 on a 1+4 node cluster.
>>
>> Thanks,
>> Mahdi
>>
>
>

Mime
View raw message