hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Drome (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-14870) OracleStore: RawStore implementation optimized for Oracle
Date Tue, 13 Dec 2016 21:17:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746299#comment-15746299
] 

Chris Drome edited comment on HIVE-14870 at 12/13/16 9:17 PM:
--------------------------------------------------------------

[~alangates], let me answer from the bottom up.

Page 2-3 explains what I did regarding deduplicating data. In short, I have removed LOCATION
and CD_ID from the SDS table, because that results in a unique entry per table/partition.
I also collapsed SDS and SERDES tables into a single table. These two changes result in a
decrease from 3.4M records to 15 records.

I didn't check the impact of each individual change, but all of the changes in aggregate result
in a 3-4x speed up for getTable calls.

I haven't tested array types to replace columns, etc because some of our table consist of
100s of columns and felt that the tradeoff would not be worth it (concerned about needlessly
bloating tables with array types). I plan to implement the same caching mechanism that you
employ in HBaseStore, so the savings we get would be minimized. Furthermore, getTable calls
take a fraction of the time that getPartitions calls take, so the majority of the effort was
to optimize those calls.

I'm currently working with our QE to hammer out the last couple of failures that we are hitting
in regression/integration tests. I'd like to refactor and clean up some code around the getPartitions
calls as well. I hope to have a cleaner version that I can post before the end of the year.


was (Author: cdrome):
[~alangates], let me answer from the bottom up.

Page 2-3 explains what I did regarding deduplicating data. In short, I have removed LOCATION
and CD_ID from the SDS table, because that results in a unique entry per table/partition.
I also collapsed SDS and SERDES tables into a single table. These two changes result in a
decrease from 3.4M records to 15 records.

I didn't check the impact of each individual change, but all of the changes in aggregate result
in a 3-4x speed up for getTable calls.

I haven't tested array types to replace columns, etc because some of our table consist of
100s of columns and felt that the tradeoff would not be worth it. I plan to implement the
same caching mechanism that you employ in HBaseStore, so the savings we get would be minimized.
Furthermore, getTable calls take a fraction of the time that getPartitions calls take, so
the majority of the effort was to optimize those calls.

I'm currently working with our QE to hammer out the last couple of failures that we are hitting
in regression/integration tests. I'd like to refactor and clean up some code around the getPartitions
calls as well. I hope to have a cleaner version that I can post before the end of the year.

> OracleStore: RawStore implementation optimized for Oracle
> ---------------------------------------------------------
>
>                 Key: HIVE-14870
>                 URL: https://issues.apache.org/jira/browse/HIVE-14870
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>            Reporter: Chris Drome
>            Assignee: Chris Drome
>         Attachments: OracleStoreDesignProposal.pdf
>
>
> The attached document is a proposal for a RawStore implementation which is optimized
for Oracle and replaces DataNucleus. The document outlines schema changes, OracleStore implementation
details, and performance tests against ObjectStore, ObjectStore+DirectSQL, and OracleStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message