spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Why does sortByKey launch cluster job?
Date Fri, 10 Jan 2014 06:55:39 GMT
I filed it and submitted the PR that Josh suggested:

https://spark-project.atlassian.net/browse/SPARK-1021
https://github.com/apache/incubator-spark/pull/379


On Wed, Jan 8, 2014 at 9:56 AM, Andrew Ash <andrew@andrewash.com> wrote:

> And at the moment we should use the atlassian.net Jira instance, not the
> apache.org one?  The apache one looks empty.
>
> https://spark-project.atlassian.net/browse/SPARK
> https://issues.apache.org/jira/browse/SPARK
>
>
> On Wed, Jan 8, 2014 at 9:04 AM, Aaron Davidson <ilikerps@gmail.com> wrote:
>
>> Feel free to always file official bugs in Jira, as long as it's not
>> already there!
>>
>>
>> On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>
>>> Hi Josh,
>>>
>>> I just ran into this again myself and noticed that the source hasn't
>>> changed since we discussed in December.  Should I file an official bug in
>>> Jira?
>>>
>>> Andrew
>>>
>>>
>>> On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <rosenville@gmail.com>wrote:
>>>
>>>> I wonder whether making RangePartitoner .rangeBounds into a lazy val
>>>> would fix this (
>>>> https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>>>>  We'd need to make sure that rangeBounds() is never called before an action
>>>> is performed.  This could be tricky because it's called in the
>>>> RangePartitioner.equals() method.  Maybe it's sufficient to just compare
>>>> the number of partitions, the ids of the RDDs used to create the
>>>> RangePartitioner, and the sort ordering.  This still supports the case
>>>> where I range-partition one RDD and pass the same partitioner to a
>>>> different RDD.  It breaks support for the case where two range partitioners
>>>> created on different RDDs happened to have the same rangeBounds(), but it
>>>> seems unlikely that this would really harm performance since it's probably
>>>> unlikely that the range partitioners are equal by chance.
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <ryan@tracevector.com>wrote:
>>>>
>>>>> Thanks for the responses!  I agree that b seems like it would be
>>>>> better.  I could imagine optimizations that could be made if a filter
call
>>>>> came after the sortByKey that would make the initial partitioning
>>>>> sub-optimal.  Plus this way, it's a pain to use in the REPL.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>>
>>>>>> Since sortByKey() invokes those right now, we should either a) change
>>>>>> the documentation to treat note that it kicks off actions or b) change
the
>>>>>> method to execute those things lazily.
>>>>>>
>>>>>> Personally I'd prefer b but don't know how difficult that would be.
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <
>>>>>> jslenderman@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Ryan,
>>>>>>>
>>>>>>> The *sortByKey* method creates a *RangePartitioner* (see
>>>>>>> Partitioner.scala), and the initialization code of the
>>>>>>> *RangePartitioner* invokes actions *count* and *sample*.
>>>>>>>
>>>>>>>
>>>>>>> Jason
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <ryan@tracevector.com>wrote:
>>>>>>>
>>>>>>>> sortByKey is listed as a data transformation, not an action,
yet it
>>>>>>>> launches a job.  This doesn't seem to square with the documentation.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message