hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt McCline (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-19200) Vectorization: Disable vectorization for LLAP I/O when a non-VECTORIZED_INPUT_FILE_FORMAT mode is needed (i.e. rows) and data type conversion is needed
Date Fri, 13 Apr 2018 16:33:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-19200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matt McCline updated HIVE-19200:
--------------------------------
    Description: 
Disable vectorization for issue in HIVE-18763 until we can do the harder VRB conversion code.

The main changes are:

1) In the Vectorizer, detect if data type conversion is needed between the partition and the
desired table schema.  If so and LLAP I/O is enabled that does encoded catching, then do
not vectorize.  Why? When LLAP I/O is in encoded catching mode, it delivers VectorizedRowBatch
(VRB) to the VectorMapOperator instead of (object) rows.  We currently do not have logic
for converting VRBs.  So, we either get Wrong Results or more likely ClassCastException on
the expected vs actual ColumnVector columns.

2) Cleaned up error message logic.that was suppressing the new message from the EXPLAIN VECTORIZATION
display.

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------

The longer-term solution can be done later in steps:

1) Write a new code that can take a VectorizedRowBatch (VRB) and convert columns to different
data types.  This is needed when LLAP is doing its encoding / caching and feeds VRBs to VectorMapOperator
instead of rows.  Similar to what MapOperator does today, VectorMapOperator would need to
be enhanced to convert partition VRBs into the table schema VRBs that the vector operator
tree expect.

2) Today, vectorization logic is strictly positional based.  It insists that the partition
columns have the same names as the table schema.  The MapOperator (and ORC) does more general
conversion that uses column names instead of column position.  We'd need to enhance all 3
classes to handle column name based conversion.  The 3 classes are: the new VRB-to-VRB conversion
class, VectorDeserializeRow, and VectorAssignRow.

  was:
Disable vectorization for issue in HIVE-18763 until we can do the harder VRB conversion code.

The longer-term solution can be done later in steps:

1) Write a new code that can take a VectorizedRowBatch (VRB) and convert columns to different
data types.  This is needed when LLAP is doing its encoding / caching and feeds VRBs to VectorMapOperator
instead of rows.  Similar to what MapOperator does today, VectorMapOperator would need to
be enhanced to convert partition VRBs into the table schema VRBs that the vector operator
tree expect.

2) Today, vectorization logic is strictly positional based.  It insists that the partition
columns have the same names as the table schema.  The MapOperator (and ORC) does more general
conversion that uses column names instead of column position.  We'd need to enhance all 3
classes to handle column name based conversion.  The 3 classes are: the new VRB-to-VRB conversion
class, VectorDeserializeRow, and VectorAssignRow.


> Vectorization: Disable vectorization for LLAP I/O when a non-VECTORIZED_INPUT_FILE_FORMAT
mode is needed (i.e. rows) and data type conversion is needed
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-19200
>                 URL: https://issues.apache.org/jira/browse/HIVE-19200
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.0.0
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HIVE-19200.01.patch
>
>
> Disable vectorization for issue in HIVE-18763 until we can do the harder VRB conversion
code.
> The main changes are:
> 1) In the Vectorizer, detect if data type conversion is needed between the partition
and the desired table schema.  If so and LLAP I/O is enabled that does encoded catching,
then do not vectorize.  Why? When LLAP I/O is in encoded catching mode, it delivers VectorizedRowBatch
(VRB) to the VectorMapOperator instead of (object) rows.  We currently do not have logic
for converting VRBs.  So, we either get Wrong Results or more likely ClassCastException on
the expected vs actual ColumnVector columns.
> 2) Cleaned up error message logic.that was suppressing the new message from the EXPLAIN
VECTORIZATION display.
>  
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> The longer-term solution can be done later in steps:
> 1) Write a new code that can take a VectorizedRowBatch (VRB) and convert columns to
different data types.  This is needed when LLAP is doing its encoding / caching and feeds
VRBs to VectorMapOperator instead of rows.  Similar to what MapOperator does today, VectorMapOperator
would need to be enhanced to convert partition VRBs into the table schema VRBs that the vector
operator tree expect.
> 2) Today, vectorization logic is strictly positional based.  It insists that the partition
columns have the same names as the table schema.  The MapOperator (and ORC) does more general
conversion that uses column names instead of column position.  We'd need to enhance all 3
classes to handle column name based conversion.  The 3 classes are: the new VRB-to-VRB conversion
class, VectorDeserializeRow, and VectorAssignRow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message