flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabor Gevay (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (FLINK-4867) Investigate code generation for improving sort performance
Date Tue, 05 Sep 2017 11:53:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Gabor Gevay resolved FLINK-4867.
    Resolution: Done

I'm closing this, as [~heytitle] did the performance investigation, and concluded that code
generation is worthwhile to implement.

The Jira for actually implementing this is here:

> Investigate code generation for improving sort performance
> ----------------------------------------------------------
>                 Key: FLINK-4867
>                 URL: https://issues.apache.org/jira/browse/FLINK-4867
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Local Runtime
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Minor
>              Labels: performance
>         Attachments: Evaluation of Code Generation Approach for improving  Flinkā€™s
> This issue is for investigating whether code generation could speed up sorting. We should
make some performance measurements on hand-written code that is similar to what we could generate,
to see whether investing more time into this is worth it. If we find that it is worth it,
we can open a second Jira for the actual implementation of the code generation.
> I think we could generate one class at places where we currently instantiate {{QuickSort}}.
This generated class would include the functionality of {{QuickSort}}, {{NormalizedKeySorter}}
or {{FixedLengthRecordSorter}}, {{MemorySegment.compare}}, and {{MemorySegment.swap}}.
> Btw. I'm planning to give this as a student project at a TU Berlin course in the next
few months.
> Some concrete ideas about how could a generated sorter be faster than the current sorting
> * {{MemorySegment.compare}} could be specialized for
> ** Length: for small records, the loop could be unrolled
> ** Endiannes (currently it is optimized for big endian; and in the little endian case
(e.g. x86) it does a Long.reverseBytes for each long read)
> * {{MemorySegment.swapBytes}}
> ** In case of small records, using three {{UNSAFE.copyMemory}} is probably not as efficient
as a specialized swap, because
> *** We could use total loop unrolling in generated code (because we know the exact record
> *** {{UNSAFE.copyMemory}} checks for alignment first \[6,9\]
> *** We will only need 2/3 the memory bandwidth, because the temporary storage could be
a register if we swap one byte (or one {{long}}) at a time
> ** several checks might be eliminated
> * Better inlining behaviour could be achieved 
> ** Virtual function calls to the methods of {{InMemorySorter}} could be eliminated. (Note,
that these are problematic to devirtualize by the JVM if there are different derived classes
used in a single Flink job (see \[8,7\]).)
> ** {{MemorySegment.swapBytes}} is probably not inlined currently, because the excessive
checks make it too large
> ** {{MemorySegment.compare}} is probably also not inlined currently, because those two
while loops are too large
> \[6\] http://www.docjar.com/docs/api/sun/misc/Unsafe.html#copyMemory(Object, long, Object,
long, long)
> \[7\] https://shipilev.net/blog/2015/black-magic-method-dispatch/
> \[8\] http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/
> \[9\] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/cpu/x86/vm/stubGenerator_x86_64.cpp#l2409

This message was sent by Atlassian JIRA

View raw message