hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miklos Gergely (JIRA)" <>
Subject [jira] [Updated] (HIVE-20536) Add Surrogate Keys function to Hive
Date Fri, 14 Sep 2018 08:32:00 GMT


Miklos Gergely updated HIVE-20536:
    Attachment: HIVE-20536.03.patch

> Add Surrogate Keys function to Hive
> -----------------------------------
>                 Key: HIVE-20536
>                 URL:
>             Project: Hive
>          Issue Type: Task
>          Components: Hive
>            Reporter: Miklos Gergely
>            Assignee: Miklos Gergely
>            Priority: Major
>         Attachments: HIVE-20536.01.patch, HIVE-20536.02.patch, HIVE-20536.03.patch
> Surrogate keys is an ability to generate and use unique integers for each row in a table.
If we have that ability then in conjunction with default clause we can get surrogate keys
functionality. Consider following ddl:
> create table t1 (a string, b bigint default unique_long());
> We already have default clause wherein you can specify a function to provide values.
So, what we need is udf which can generate unique longs for each row across queries for a
> Idea is to use write_id . This is a column in metastore table TXN_COMPONENTS whose value
is determined at compile time to be used during query execution. Each query execution generates
a new write_id. So, we can seed udf with this value during compilation.
> Then we statically allocate ranges for each task from which it can draw next long. So,
lets say 64-bit write_id we divy up such that last 24 bits belong to original usage of it
that is txns. Next 16 bits are used for task_attempts and last 24 bits to generate new long
for each row. This implies we can allow 17M txns, 65K tasks and 17M rows in a task. If you
hit any of those limits we can fail the query.
> Implementation wise: serialize write_id in initialize() of udf. Then during execute()
we find out what task_attempt current task is and use it along with write_id() to get starting
long and give a new value on each invocation of execute().
> Here we are assuming write_id can be determined at compile time, which should be the
case but we need to figure out how to get handle to it.

This message was sent by Atlassian JIRA

View raw message