hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20720) Add partition column option to JDBC handler
Date Thu, 11 Oct 2018 23:33:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647179#comment-16647179
] 

Jesus Camacho Rodriguez commented on HIVE-20720:
------------------------------------------------

[~daijy], I believe current approach may cause problems. Assume a table 'tab' with columns
'a', 'b', and 'c'. In turn, column 'c' is the partition column. Then user (or Calcite) defines
a query:
{code:sql}
SELECT a, b FROM tab;
{code}
Unless I am mistaken, we will fail when we add the partition column predicate with current
approach, since we are doing:
{code:sql}
SELECT * FROM (SELECT a, b FROM tab) temp WHERE temp.c < z and temp.c > y;
{code}
My proposal was to try wrap the table, as this will be more general and work with all Project/Filter
queries:
{code:sql}
SELECT a, b FROM (SELECT * FROM tab WHERE temp.c < z and temp.c > y) tab;
{code}
Though maybe to do that, we need to generate an AST from the SQL. Or another option would
be to let user specify the table name, then we just need to find the {{...from tabName}} pattern.
What do you think?

> Add partition column option to JDBC handler
> -------------------------------------------
>
>                 Key: HIVE-20720
>                 URL: https://issues.apache.org/jira/browse/HIVE-20720
>             Project: Hive
>          Issue Type: New Feature
>          Components: StorageHandler
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>            Priority: Major
>         Attachments: HIVE-20720.1.patch, HIVE-20720.2.patch, HIVE-20720.3.patch, HIVE-20720.4.patch
>
>
> Currently JdbcStorageHandler does not split input in Tez. The reason is numSplit of JdbcInputFormat.getSplits
can only pass via "mapreduce.job.maps" in Tez. And "mapreduce.job.maps" is not a valid param
if authorizer(eg. SQLStdAuth) is in use. User ends up always use 1 split.
> We need to rely on this new feature if we want to support multi-splits. Here is my proposal:
> 1. Specify partitionColumn/numPartitions, and optional lowerBound/upperBound in tblproperties
if user want to split jdbc data source. In case lowerBound/upperBound is not specified, JdbcStorageHandler
will run max/min query to get this in planner. We can currently limit partitionColumn to only
numeric/date/timestamp column for simplicity
> 2. If partitionColumn/numPartitions are not specified, don't split input
> 3. Splits are equal intervals without respect to data distribution
> 4. There is also a "hive.sql.query.split" flag vetos the split (can be set manually or
automatically by calcite)
> 5. If partitionColumn is not defined, but numPartitions is defined, use original limit/offset
logic (however, don't rely on numSplit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message