spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (Jira)" <>
Subject [jira] [Updated] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code
Date Mon, 02 Mar 2020 20:38:00 GMT


Dongjoon Hyun updated SPARK-27097:
    Affects Version/s: 2.0.0

> Avoid embedding platform-dependent offsets literally in whole-stage generated code
> ----------------------------------------------------------------------------------
>                 Key: SPARK-27097
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0
>            Reporter: Xiao Li
>            Assignee: Kris Mok
>            Priority: Critical
>              Labels: correctness
>             Fix For: 2.4.1
> Avoid embedding platform-dependent offsets literally in whole-stage generated code.
> Spark SQL performs whole-stage code generation to speed up query execution. There are
two steps to it:
> Java source code is generated from the physical query plan on the driver. A single version
of the source code is generated from a query plan, and sent to all executors.
> It's compiled to bytecode on the driver to catch compilation errors before sending to
executors, but currently only the generated source code gets sent to the executors. The bytecode
compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the query runs
like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors being run
on similar platforms. Some code paths accidentally embedded platform-dependent object layout
information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets a value
to that field.
> But whole-stage code generation generally uses platform-dependent information from the
driver. If the object layout is significantly different on the driver and executors, the generated
code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX constants in generated
code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into the generated
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET,
value), which will be able to pick up the correct value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static final in Java,
so they're not compile-time constants from the Java language's perspective. It does lead to
a slightly increased size of the generated code, but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on the driver
which is invalid on the executors. e.g. if the endianness is different between the driver
and the executors, and if some generated code makes strong assumption about endianness, it
would also be problematic.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message