hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laszlo Bodor (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-21940) Metastore: Postgres text <-> clob mismatch for PARTITION_PARAMS/PARAM_VALUE
Date Mon, 01 Jul 2019 12:27:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Laszlo Bodor updated HIVE-21940:
--------------------------------
    Description: 
The issue is reproducible on a cluster with postgres metastore db by the following statements:
{code}
USE default;
drop table if exists my_table;
create external table my_table (col1 int, col3 int) partitioned by (col2 string) STORED AS
TEXTFILE;
insert into my_table VALUES(11,201,"F");
SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id = pp.part_id join
sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name = "my_table";
{code}

sys query results in:
{code}
+-------------+------------------------+-----------------+
| pp.part_id  |      pp.param_key      | pp.param_value  |
+-------------+------------------------+-----------------+
| 151         | rawDataSize            | 28629           |
| 151         | numRows                | 28628           |
| 151         | transient_lastDdlTime  | 28627           |
| 151         | COLUMN_STATS_ACCURATE  | 28626           |
| 151         | numFiles               | 28625           |
| 151         | totalSize              | 28622           |
+-------------+------------------------+-----------------+
{code}


Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience while using PARTITION_PARAMS/PARAM_VALUE,
because in postgres there is no such type as CLOB, and metastore simply saves large object
ids into this field. More interesting is that the large object can be resolved in some codepaths.
In case of a describe for partition it works correctly:
{code}
describe formatted my_table_for_sqoop partition (col2='F');

...

| Partition Parameters:             | NULL                                               |
NULL                                               |
|                                   | COLUMN_STATS_ACCURATE                              |
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col3\":\"true\"}} |
|                                   | numFiles                                           |
1                                                  |
|                                   | numRows                                            |
1                                                  |
|                                   | rawDataSize                                        |
6                                                  |
|                                   | totalSize                                          |
7                                                  |
|                                   | transient_lastDdlTime                              |
1561976024                                         |
|                                   | NULL                                               |
NULL                                               |

{code}

But in case of a direct metastore query (from hive's sys schema, but the same result for direct
postgres), it shows the result above (see sys query output). This is an issue when hive treats
these ids as they were real values, but they are obviously not correct, and this causes various
failures (e.g. using serde parameter serialization.format=28392)



param_value values above are large object ids, according to pg_dump
| 151         | COLUMN_STATS_ACCURATE  | 28626           |

{code}
SELECT pg_catalog.lo_open('28626', 131072);
SELECT pg_catalog.lowrite(0, '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
SELECT pg_catalog.lo_close(0);

{code}
decoded large object value:
{code}
{"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
{code}


  was:
The issue is reproducible on a cluster with postgres metastore db by the following statements:
{code}
USE default;
drop table if exists my_table;
create external table my_table (col1 int, col3 int) partitioned by (col2 string) STORED AS
TEXTFILE;
insert into my_table VALUES(11,201,"F");
SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id = pp.part_id join
sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name = "my_table";
{code}

sys query results in:
{code}
+-------------+------------------------+-----------------+
| pp.part_id  |      pp.param_key      | pp.param_value  |
+-------------+------------------------+-----------------+
| 151         | rawDataSize            | 28629           |
| 151         | numRows                | 28628           |
| 151         | transient_lastDdlTime  | 28627           |
| 151         | COLUMN_STATS_ACCURATE  | 28626           |
| 151         | numFiles               | 28625           |
| 151         | totalSize              | 28622           |
+-------------+------------------------+-----------------+
{code}


Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience while using PARTITION_PARAMS/PARAM_VALUE,
because in postgres there is no such type as CLOB, and metastore simply saves large object
ids into this field. More interesting is that the large object can be resolved in some codepaths.
In case of a describe for partition it works correctly:
{code}
describe formatted my_table_for_sqoop partition (col2='F');

...

| Partition Parameters:             | NULL                                               |
NULL                                               |
|                                   | COLUMN_STATS_ACCURATE                              |
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col3\":\"true\"}} |
|                                   | numFiles                                           |
1                                                  |
|                                   | numRows                                            |
1                                                  |
|                                   | rawDataSize                                        |
6                                                  |
|                                   | totalSize                                          |
7                                                  |
|                                   | transient_lastDdlTime                              |
1561976024                                         |
|                                   | NULL                                               |
NULL                                               |

{code}

But in case for direct metastore query (from hive's sys schema, but the same result for direct
postgres), it shows the result above. This is an issue when hive treats these ids as is, but
they are obviously not correct, and this causes various failures (e.g. using serde parameter
serialization.format=28392)



param_value values above are large object ids, according to pg_dump
| 151         | COLUMN_STATS_ACCURATE  | 28626           |

{code}
SELECT pg_catalog.lo_open('28626', 131072);
SELECT pg_catalog.lowrite(0, '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
SELECT pg_catalog.lo_close(0);

{code}
decoded large object value:
{code}
{"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
{code}



> Metastore: Postgres text <-> clob mismatch for PARTITION_PARAMS/PARAM_VALUE
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-21940
>                 URL: https://issues.apache.org/jira/browse/HIVE-21940
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Laszlo Bodor
>            Assignee: Laszlo Bodor
>            Priority: Major
>             Fix For: 4.0.0
>
>
> The issue is reproducible on a cluster with postgres metastore db by the following statements:
> {code}
> USE default;
> drop table if exists my_table;
> create external table my_table (col1 int, col3 int) partitioned by (col2 string) STORED
AS TEXTFILE;
> insert into my_table VALUES(11,201,"F");
> SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id = pp.part_id
join sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name = "my_table";
> {code}
> sys query results in:
> {code}
> +-------------+------------------------+-----------------+
> | pp.part_id  |      pp.param_key      | pp.param_value  |
> +-------------+------------------------+-----------------+
> | 151         | rawDataSize            | 28629           |
> | 151         | numRows                | 28628           |
> | 151         | transient_lastDdlTime  | 28627           |
> | 151         | COLUMN_STATS_ACCURATE  | 28626           |
> | 151         | numFiles               | 28625           |
> | 151         | totalSize              | 28622           |
> +-------------+------------------------+-----------------+
> {code}
> Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience while using
PARTITION_PARAMS/PARAM_VALUE, because in postgres there is no such type as CLOB, and metastore
simply saves large object ids into this field. More interesting is that the large object can
be resolved in some codepaths. In case of a describe for partition it works correctly:
> {code}
> describe formatted my_table_for_sqoop partition (col2='F');
> ...
> | Partition Parameters:             | NULL                                          
    | NULL                                               |
> |                                   | COLUMN_STATS_ACCURATE                         
    | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col3\":\"true\"}} |
> |                                   | numFiles                                      
    | 1                                                  |
> |                                   | numRows                                       
    | 1                                                  |
> |                                   | rawDataSize                                   
    | 6                                                  |
> |                                   | totalSize                                     
    | 7                                                  |
> |                                   | transient_lastDdlTime                         
    | 1561976024                                         |
> |                                   | NULL                                          
    | NULL                                               |
> {code}
> But in case of a direct metastore query (from hive's sys schema, but the same result
for direct postgres), it shows the result above (see sys query output). This is an issue when
hive treats these ids as they were real values, but they are obviously not correct, and this
causes various failures (e.g. using serde parameter serialization.format=28392)
> param_value values above are large object ids, according to pg_dump
> | 151         | COLUMN_STATS_ACCURATE  | 28626           |
> {code}
> SELECT pg_catalog.lo_open('28626', 131072);
> SELECT pg_catalog.lowrite(0, '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
> SELECT pg_catalog.lo_close(0);
> {code}
> decoded large object value:
> {code}
> {"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message