hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Swarnim Kulkarni (JIRA)" <>
Subject [jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
Date Mon, 01 Feb 2016 03:14:39 GMT


Swarnim Kulkarni commented on HIVE-6147:

It is pretty common to use schema-less avro objects in HBase.

I am not sure if that is true(if possible at all). As far as my understanding goes, you will
have to almost always provide the exact schema that was used while persisting the data when
attempting to deserialize it and the best way to do that would be to store alongside the schema
itself. Plus schema evolution is going to be a mess. Imagine writing a billion rows in HBase
with one schema which evolves and then you write another billion rows with new schema. How
do you ensure the first billion rows are still correctly readable?

(if there are billions of rows with objects of the same type, it is not reasonable to store
the same schema in all of them) and it is not convenient to write a customer schema retriever
for each such case.

Correct. I agree it is inefficient to store it for every single cell. Although IMO that isn't
a good excuse to not write the schema at all. A better design in this case is to use some
kind of schema registry, use a custom serializer, write the schema to the schema registry,
generate a id of some kind and persist the id along with the data. Then when you are reading
the data, use the id to pull the schema from the store and read the data. That is also where
a custom implementation of an AvroSchemaRetriever makes sense where your custom implementation
would know how to read your schema from the schema registry and get that to hive and let hive
handle the deserialization from there on.  

> Support avro data stored in HBase columns
> -----------------------------------------
>                 Key: HIVE-6147
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Swarnim Kulkarni
>            Assignee: Swarnim Kulkarni
>              Labels: TODOC14
>             Fix For: 0.14.0
>         Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, HIVE-6147.3.patch.txt,
HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
> Presently, the HBase Hive integration supports querying only primitive data types in
columns. It would be nice to be able to store and query Avro objects in HBase columns by making
them visible as structs to Hive. This will allow Hive to perform ad hoc analysis of HBase
data which can be deeply structured.

This message was sent by Atlassian JIRA

View raw message