hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-19205) Hive streaming ingest improvements (v2)
Date Fri, 13 Apr 2018 19:04:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-19205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Prasanth Jayachandran updated HIVE-19205:
-----------------------------------------
    Description: 
This is umbrella jira to track hive streaming ingest improvements. At a high level following
are the improvements
- Support for dynamic partitioning
- API changes (simple streaming connection builder)
- Hide the transaction batches from clients (client can tune the transaction batch but doesn't
have to know about the transaction batch size)
- Support auto rollover to next transaction batch (clients don't have to worry about closing
a transaction batch and opening a new one)
- Record writers will all be strict meaning the schema of the record has to match table schema.
This is to avoid the multiple serialization/deserialization for re-ordering columns if there
is schema mismatch
- Automatic distribution for non-bucketed tables so that compactor can have more parallelism
- Create delta files with all ORC overhead disabled (no index, no compression, no dictionary).
Compactor will recreate the orc files with compression and dictionary encoding.
- Automatic memory management via auto-flushing (will yield smaller stripes for delta files
but is more scalable and clients don't have to worry about distributing the data across writers)
- Support for more writers (Avro specifically. ORC passthrough format?)
- Support to accept input stream instead of record byte[]
- Removing HCatalog dependency (old streaming API will be in the hcatalog package for backward
compatibility, new streaming API will be in its own hive module)

  was:
This is umbrella jira to track hive streaming ingest improvements. At a high level following
are the improvements
- Support for dynamic partitioning
- API changes (simple streaming connection builder)
- Hide the transaction batches from clients (client can tune the transaction batch but doesn't
have to know about the transaction batch size)
- Support auto rollover to next transaction batch (clients don't have to worry about closing
a transaction batch and opening a new one)
- Record writers will all be strict meaning the schema of the record has to match table schema.
This is to avoid the multiple serialization/deserialization for re-ordering columns if there
is schema mismatch
- Automatic distribution for non-bucketed tables so that compactor can have more parallelism
- Create delta files with all ORC overhead disabled (no compression, no dictionary). Compactor
will recreate the orc files with compression and dictionary encoding.
- Automatic memory management via auto-flushing (will yield smaller stripes for delta files
but is more scalable and clients don't have to worry about distributing the data across writers)
- Support for more writers (Avro specifically. ORC passthrough format?)
- Support to accept input stream instead of record byte[]
- Removing HCatalog dependency (old streaming API will be in the hcatalog package for backward
compatibility, new streaming API will be in its own hive module)


> Hive streaming ingest improvements (v2)
> ---------------------------------------
>
>                 Key: HIVE-19205
>                 URL: https://issues.apache.org/jira/browse/HIVE-19205
>             Project: Hive
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>
> This is umbrella jira to track hive streaming ingest improvements. At a high level following
are the improvements
> - Support for dynamic partitioning
> - API changes (simple streaming connection builder)
> - Hide the transaction batches from clients (client can tune the transaction batch but
doesn't have to know about the transaction batch size)
> - Support auto rollover to next transaction batch (clients don't have to worry about
closing a transaction batch and opening a new one)
> - Record writers will all be strict meaning the schema of the record has to match table
schema. This is to avoid the multiple serialization/deserialization for re-ordering columns
if there is schema mismatch
> - Automatic distribution for non-bucketed tables so that compactor can have more parallelism
> - Create delta files with all ORC overhead disabled (no index, no compression, no dictionary).
Compactor will recreate the orc files with compression and dictionary encoding.
> - Automatic memory management via auto-flushing (will yield smaller stripes for delta
files but is more scalable and clients don't have to worry about distributing the data across
writers)
> - Support for more writers (Avro specifically. ORC passthrough format?)
> - Support to accept input stream instead of record byte[]
> - Removing HCatalog dependency (old streaming API will be in the hcatalog package for
backward compatibility, new streaming API will be in its own hive module)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message