asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Luo <>
Subject Re: Adapting TimSort into AsterixDB/Hyracks
Date Sat, 28 Oct 2017 21:26:12 GMT
Another sort performance issue to notice is the "normalized key".  When we
sort frames of tuples, we build a directory of pointers to avoid memory
movements. Each record pointer has four integers, i.e., *frame Id, tuple
start, tuple end, normalized key*. Normalized key is critical to speed up
comparisons by avoiding random memory accesses. On 1GB random records,
using normalized key can improve sort performance by *3-4x*. However, this
normalized key idea is under utilized in AsterixDB layer:

   - The current length of normalized key is only 4 bytes, which limits its
   applicability of longer (e.g., time/long/double) but fixed length types.
   - Normalized key only skips memory accesses when two keys are different.
   However, when records have a lot of duplicate keys, e.g., order on some
   stateID or countyID, current mechanism won't help too much.
   - Only a limited number of types have implemented this normalized key
   computer (see [1]). For example, *UUID does not support it*, which makes
   it much slower to sort UUID keys. Ideally, every sortable type should have
   a corresponding mechanism to compute normalized key.

These issues need to be resolved (maybe later) to improve sort performance.
Otherwise, even the best cache-friendly sorting algorithm won't help too
much, since each comparison would incur multiple cache misses.


On Sat, Oct 28, 2017 at 1:50 PM, Chen Luo <> wrote:

> That can be done in Hyracks layer, but enabling Quicksort in AsterixDB has
> other issues. Right now AsterixDB only uses merge sort because of its
> stableness. Once switching to Quicksort, we have to make sure all the other
> operators are happy with this change (some operators are OK with it, but
> some are not).
> On Sat, Oct 28, 2017 at 12:05 PM, Mike Carey <> wrote:
>> So the word on the web seems maybe to be that Quicksort is generally
>> superior and cache-friendly (cache oblivious). Wondering if we should just
>> get our Quicksort code under control?
>> On 10/28/17 11:36 AM, Chen Luo wrote:
>>> Not exactly sure about this right now... But since TimSort is
>>> essentially a
>>> combination of insertion sort and merge sort, it's cache-friendliness
>>> won't
>>> be worse than our merge sort.
>>> This TimSort could be served as a short-term plugin-and-play improvement
>>> of
>>> our sorting algorithm. It is still stable, the same as our current merge
>>> sort, but faster, especially on partially ordered dataset.
>>> Best regards,
>>> Chen Luo
>>> On Sat, Oct 28, 2017 at 10:58 AM, Mike Carey <> wrote:
>>> How is it on cache-friendliness?
>>>> On 10/27/17 11:38 PM, abdullah alamoudi wrote:
>>>> While I have no answer to the question of legality, this sounds great.
>>>>> ~Abdullah.
>>>>> On Oct 27, 2017, at 9:20 PM, Chen Luo <> wrote:
>>>>>> Hi devs,
>>>>>> I have adapted the TimSort algorithm used in JDK (java.util.TimSort)
>>>>>> into
>>>>>> Hyracks, which gives 10-20% performance improvements on random data.
>>>>>> It
>>>>>> will be more useful if the input data is partially sorted, e.g.,
>>>>>> primary
>>>>>> keys fetched from secondary index scan, which I haven't got time
>>>>>> experiment with.
>>>>>> *Before going any further, is it legal to adapt some algorithm
>>>>>> implementation from JDK into our codebase? *I saw the JDK
>>>>>> implementation
>>>>>> itself is adopted from
>>>>>> well.
>>>>>> Best regards,
>>>>>> Chen Luo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message