hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Swarnim Kulkarni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
Date Mon, 01 Feb 2016 18:59:39 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126803#comment-15126803
] 

Swarnim Kulkarni commented on HIVE-6147:
----------------------------------------

{quote}
Avro supports schema evolution that allows data to be written with one schema and read with
another
{quote}

Yup. Definitely agree. However the point I was trying to make is that you would still need
to provide the same exact schema that was used when writing the data. Let's take an example.
Let's say you used Schema S1 to write a billion rows to HBase. The Schema then evolved to
S2(hopefully in a compatible way) and you write another billion rows with it. The Schema evolves
again to S3 and then you write another billion rows. Now to be able to read all this data,
this is what you would need to do.

1st billion rows:

Writer Schema: S1
Reader Schema: S3

2nd billion rows:

Writer Schema: S2
Reader Schema: S3

3rd billion rows:

Writer Schema: S3
Reader Schema: S3

So as you see, you are still providing the *exact same version* of the schema that was used
to write the data to be able to read it back successfully. Without it, it would be extremely
hard for avro for make out head and tail of our data. You "might" still get lucky and be able
to deserialize the 1st billion rows using S3 as reader/writer schema but there are absolutely
no guarantees whatsoever. Which is why you would still need a way regardless to track what
schema was used to write the persist the data when you read it back and the current design
of hive/hbase avro support closely follows that pattern.

> Support avro data stored in HBase columns
> -----------------------------------------
>
>                 Key: HIVE-6147
>                 URL: https://issues.apache.org/jira/browse/HIVE-6147
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Swarnim Kulkarni
>            Assignee: Swarnim Kulkarni
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, HIVE-6147.3.patch.txt,
HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data types in
columns. It would be nice to be able to store and query Avro objects in HBase columns by making
them visible as structs to Hive. This will allow Hive to perform ad hoc analysis of HBase
data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message