spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: almost sorted data
Date Fri, 25 Oct 2013 22:20:11 GMT
Currently, our sortByKey should be using Java's native Timsort
implementation, which is an adaptive sort. That should also mean sorting is
very fast for almost-sorted data. The overhead you're seeing might be
caused by reshuffling everything during the range partitioning step *before
*the sort, which has to serialize all your data.

Nathan's solution might then work out nicely for you, as it will avoid
shuffling the data.


On Fri, Oct 25, 2013 at 9:18 AM, Josh Rosen <rosenville@gmail.com> wrote:

> Adaptive sorting algorithms (https://en.wikipedia.org/wiki/Adaptive_sort)
> can benefit from presortedness in their inputs, so that might be a
> helpful search term for researching this problem.
>
>
> On Fri, Oct 25, 2013 at 7:23 AM, Nathan Kronenfeld <
> nkronenfeld@oculusinfo.com> wrote:
>
>> I suspect from his description the difference is negligible for his case.
>>  However, there are ways around that anyway.
>>
>> Assuming a fixed data set (as opposed to something like a streaming
>> example, where there is no last element), one can take 3 passes to:
>>
>>    1. get the last element of each partition
>>    2. take elements from each partition that fall before the last
>>    element of the previous partition, separate them from the rest of their
>>    partition
>>    3. and add them to the previous (whichever previous is appropriate,
>>    in really degenerate cases, which it sounds like he doesn't have) in the
>>    right location
>>
>>
>>
>>
>> On Fri, Oct 25, 2013 at 10:17 AM, Sebastian Schelter <ssc@apache.org>wrote:
>>
>>> Using a local sort per partition only gives a correct result if the data
>>> is already range partitioned.
>>>
>>> On 25.10.2013 16:11, Nathan Kronenfeld wrote:
>>> > Since no one else has answered...
>>> > I assume:
>>> >
>>> >     data.mapPartitions(_.toList.sortBy(...).toIterator)
>>> >
>>> > would work, but I also suspect there's a better way.
>>> >
>>> >
>>> > On Fri, Oct 25, 2013 at 5:01 AM, Arun Kumar <arunpatala@gmail.com>
>>> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I am trying to process some logs and the data is sorted(*almost*) by
>>> >> timestamp.
>>> >> If I do a full sort it takes a lot of time. Is there some way to sort
>>> more
>>> >> efficiently (like restricting sort to per partition).
>>> >>
>>> >> Thanks in advance
>>> >>
>>> >
>>> >
>>> >
>>>
>>>
>>
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenfeld@oculusinfo.com
>>
>
>

Mime
View raw message