hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-18098) Add support for Export/Import for Acid tables
Date Fri, 01 Dec 2017 18:00:10 GMT

     [ https://issues.apache.org/jira/browse/HIVE-18098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman updated HIVE-18098:
----------------------------------
    Description: 
How should this work?
For regular tables export just copies the files under table root to a specified directory.
This doesn't make sense for Acid tables:
* Some data may belong to aborted transactons
* Transaction IDs are imbedded into data/files names.  You'd have export delta/ and base/
each of which may have files with the same names, e.g. bucket_00000.   
* On import these IDs won't make sense in a different cluster or even a different table (which
may have delta_x_x for example for the same x (but different data of course).
* Export creates a _metadata column types, storage format, etc.  Perhaps it can include info
about aborted IDs (if the whole file can't be skipped).
* Even importing into the same table on the same cluster may be a problem.  For example delta_5_5/
existed at the time of export and was included in the export.  But 2 days later it may not
exist because it was compacted and cleaned.
* If importing back into the same table on the same cluster, the data could be imported into
a different transaction (assuming per table writeIDs) w/o having to remap the IDs in the rows
themselves.
* support Import Overwrite?
* Support Import as a new txn with remapping of ROW_IDs?  The new writeID can be stored in
a delta_x_x/_meta_data and ROW__IDs can be remapped at read time (like isOriginal) and made
permanent by compaction.
* It doesn't seem reasonable to import acid data into non-acid table

Perhaps import can work similar to Load Data: look at the file imported, if it has Acid columns,
leave a note in the delta_x_x/_meta_data to indicate that these columns should be skipped
a new ROW_IDs assigned at read time.

h3. Case I
Table has delta_7_7 and delta_8_8.  Sine both may have bucket_0000, we could export to export_dir
and rename files as bucket_0000 and bucket_0000_copy_1.  Load Data supports input dir with
copy_N files.
h3. Case II
what if we have delete_delta_9_9 in the source.  Now you can't just ignore ROW_IDs after import.
* -Only export the latest base_N?  Or more generally up to the smallest deleted ROW_ID (which
may be hard to find w/o scanning all deletes.  The export then would have to be done under
X lock to prevent new concurrent deletes)-
* Stash all deletes in some additional file file which on import gets added into the target
delta/ so that Acid reader can apply them to the data in this delta/ but so that they don't
clash with 'normal' deletes that exist in the table.
** here we may also have multiple delete_delta/ with identical file names.  Does delete delta
reader handle copy_N files?


  was:
How should this work?
For regular tables export just copies the files under table root to a specified directory.
This doesn't make sense for Acid tables:
* Some data may belong to aborted transactons
* Transaction IDs are imbedded into data/files names.  You'd have export delta/ and base/
each of which may have files with the same names, e.g. bucket_00000.   
* On import these IDs won't make sense in a different cluster or even a different table (which
may have delta_x_x for example for the same x (but different data of course).
* Export creates a _metadata column types, storage format, etc.  Perhaps it can include info
about aborted IDs (if the whole file can't be skipped).
* Even importing into the same table on the same cluster may be a problem.  For example delta_5_5/
existed at the time of export and was included in the export.  But 2 days later it may not
exist because it was compacted and cleaned.
* If importing back into the same table on the same cluster, the data could be imported into
a different transaction (assuming per table writeIDs) w/o having to remap the IDs in the rows
themselves.
* support Import Overwrite?
* Support Import as a new txn with remapping of ROW_IDs?  The new writeID can be stored in
a delta_x_x/_meta_data and ROW__IDs can be remapped at read time (like isOriginal) and made
permanent by compaction.
* It doesn't seem reasonable to import acid data into non-acid table

Perhaps import can work similar to Load Data: look at the file imported, if it has Acid columns,
leave a note in the delta_x_x/_meta_data to indicate that these columns should be skipped
a new ROW_IDs assigned at read time.

h3. Case I
Table has delta_7_7 and delta_8_8.  Sine both may have bucket_0000, we could export to export_dir
and rename files as bucket_0000 and bucket_0000_copy_1.  Load Data supports input dir with
copy_N files.
h3. Case II
what if we have delete_delta_9_9 in the source.  Now you can't just ignore ROW_IDs after import.
* Only export the latest base_N?  Or more generally up to the smallest deleted ROW_ID (which
may be hard to find w/o scanning all deletes.  The export then would have to be done under
X lock to prevent new concurrent deletes)
* Stash all deletes in some additional file file which on import gets added into the target
delta/ so that Acid reader can apply them to the data in this delta/ but so that they don't
clash with 'normal' deletes that exist in the table.
** here we may also have multiple delete_delta/ with identical file names.  Does delete delta
reader handle copy_N files?



> Add support for Export/Import for Acid tables
> ---------------------------------------------
>
>                 Key: HIVE-18098
>                 URL: https://issues.apache.org/jira/browse/HIVE-18098
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> How should this work?
> For regular tables export just copies the files under table root to a specified directory.
> This doesn't make sense for Acid tables:
> * Some data may belong to aborted transactons
> * Transaction IDs are imbedded into data/files names.  You'd have export delta/ and base/
each of which may have files with the same names, e.g. bucket_00000.   
> * On import these IDs won't make sense in a different cluster or even a different table
(which may have delta_x_x for example for the same x (but different data of course).
> * Export creates a _metadata column types, storage format, etc.  Perhaps it can include
info about aborted IDs (if the whole file can't be skipped).
> * Even importing into the same table on the same cluster may be a problem.  For example
delta_5_5/ existed at the time of export and was included in the export.  But 2 days later
it may not exist because it was compacted and cleaned.
> * If importing back into the same table on the same cluster, the data could be imported
into a different transaction (assuming per table writeIDs) w/o having to remap the IDs in
the rows themselves.
> * support Import Overwrite?
> * Support Import as a new txn with remapping of ROW_IDs?  The new writeID can be stored
in a delta_x_x/_meta_data and ROW__IDs can be remapped at read time (like isOriginal) and
made permanent by compaction.
> * It doesn't seem reasonable to import acid data into non-acid table
> Perhaps import can work similar to Load Data: look at the file imported, if it has Acid
columns, leave a note in the delta_x_x/_meta_data to indicate that these columns should be
skipped a new ROW_IDs assigned at read time.
> h3. Case I
> Table has delta_7_7 and delta_8_8.  Sine both may have bucket_0000, we could export to
export_dir and rename files as bucket_0000 and bucket_0000_copy_1.  Load Data supports input
dir with copy_N files.
> h3. Case II
> what if we have delete_delta_9_9 in the source.  Now you can't just ignore ROW_IDs after
import.
> * -Only export the latest base_N?  Or more generally up to the smallest deleted ROW_ID
(which may be hard to find w/o scanning all deletes.  The export then would have to be done
under X lock to prevent new concurrent deletes)-
> * Stash all deletes in some additional file file which on import gets added into the
target delta/ so that Acid reader can apply them to the data in this delta/ but so that they
don't clash with 'normal' deletes that exist in the table.
> ** here we may also have multiple delete_delta/ with identical file names.  Does delete
delta reader handle copy_N files?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message