hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <>
Subject [jira] [Commented] (HIVE-20720) Add partition column option to JDBC handler
Date Thu, 18 Oct 2018 04:13:00 GMT


Jesus Camacho Rodriguez commented on HIVE-20720:

[~daijy], thanks. Wrt pattern matching on FROM clause, I believe it is quite safe with your
latest change: Calcite will only set the splittable flag to 'true' for Select-Filter-Scan
queries (no join, group by, or other statements), and if user is facing issues with hardcoded
query, they can always rewrite it. As we move forward and we split more complex computation,
we may revisit that logic.

+1 (pending tests)

> Add partition column option to JDBC handler
> -------------------------------------------
>                 Key: HIVE-20720
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: StorageHandler
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>            Priority: Major
>         Attachments: HIVE-20720.1.patch, HIVE-20720.2.patch, HIVE-20720.3.patch, HIVE-20720.4.patch,
HIVE-20720.5.patch, HIVE-20720.6.patch, HIVE-20720.7.patch, HIVE-20720.8.patch
> Currently JdbcStorageHandler does not split input in Tez. The reason is numSplit of JdbcInputFormat.getSplits
can only pass via "mapreduce.job.maps" in Tez. And "mapreduce.job.maps" is not a valid param
if authorizer(eg. SQLStdAuth) is in use. User ends up always use 1 split.
> We need to rely on this new feature if we want to support multi-splits. Here is my proposal:
> 1. Specify partitionColumn/numPartitions, and optional lowerBound/upperBound in tblproperties
if user want to split jdbc data source. In case lowerBound/upperBound is not specified, JdbcStorageHandler
will run max/min query to get this in planner. We can currently limit partitionColumn to only
numeric/date/timestamp column for simplicity
> 2. If partitionColumn/numPartitions are not specified, don't split input
> 3. Splits are equal intervals without respect to data distribution
> 4. There is also a "hive.sql.query.split" flag vetos the split (can be set manually or
automatically by calcite)
> 5. If partitionColumn is not defined, but numPartitions is defined, use original limit/offset
logic (however, don't rely on numSplit).

This message was sent by Atlassian JIRA

View raw message