hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <>
Subject [jira] [Commented] (HIVE-17361) Support LOAD DATA for transactional tables
Date Fri, 01 Dec 2017 01:13:01 GMT


Alan Gates commented on HIVE-17361:

+1 based on discussion in review board.

> Support LOAD DATA for transactional tables
> ------------------------------------------
>                 Key: HIVE-17361
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>            Reporter: Wei Zheng
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-17361.07.patch, HIVE-17361.08.patch, HIVE-17361.09.patch, HIVE-17361.1.patch,
HIVE-17361.10.patch, HIVE-17361.11.patch, HIVE-17361.12.patch, HIVE-17361.14.patch, HIVE-17361.16.patch,
HIVE-17361.17.patch, HIVE-17361.19.patch, HIVE-17361.2.patch, HIVE-17361.20.patch, HIVE-17361.21.patch,
HIVE-17361.23.patch, HIVE-17361.24.patch, HIVE-17361.25.patch, HIVE-17361.3.patch, HIVE-17361.4.patch
> LOAD DATA was not supported since ACID was introduced. Need to fill this gap between
ACID table and regular hive table.
> Current Documentation is under [DML Operations|]
and [Loading files into tables|]:
> \\
> * Load Data performs very limited validations of the data, in particular it uses the
input file name which may not be in 00000_0 which can break some read logic.  (Certainly will
for Acid).
> * It does not check the schema of the file.  This may be a non issue for Acid which requires
ORC which is self describing so Schema Evolution may handle this seamlessly.  (Assuming Schema
is not too different).
> * It does check that _InputFormat_S are compatible. 
> * Bucketed (and thus sorted) tables don't support Load Data (but only if hive.strict.checks.bucketing=true
(default)).  Will keep this restriction for Acid.
> * Load Data supports OVERWRITE clause
> * What happens to file permissions/ownership: rename vs copy differences
> \\
> The implementation will follow the same idea as in HIVE-14988 and use a base_N/ dir for
> \\
> How is minor compaction going to handle delta/base with original files?
> Since delta_8_8/_meta_data is created before files are moved, delta_8_8 becomes visible
before it's populated.  Is that an issue?
> It's not since txn 8 is not committed.
> h3. Implementation Notes/Limitations (patch 25)
> * bucketed/sorted tables are not supported
> * input files names must be of the form 00000_0/00000_0_copy_1 - enforced. (HIVE-18125)
> * Load Data creates a delta_x_x/ that contains new files
> * Load Data w/Overwrite creates a base_x/ that contains new files
> * A '_metadata_acid' file is placed in the target directory to indicate it requires special
handling on read
> * The input files must be 'plain' ORC files, i.e. not contain acid metadata columns as
would be the case if these files were copied from another Acid table.  In the latter case,
the ROW_IDs embedded in the data may not make sense in the target table (if it's in a different
cluster, for example).  Such files may also have a mix of committed and aborted data.
> ** this could be relaxed later by adding info to the _metadata_acid file to ignore existing
ROW_IDs on read.
> * ROW_IDs are attached dynamically at read time and made permanent by compaction.  This
is done the same way has handling of files that were written to a table before it was converted
to Acid.
> * Vectorization is supported

This message was sent by Atlassian JIRA

View raw message