hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format
Date Sun, 23 Sep 2018 10:33:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625059#comment-16625059
] 

Hive QA commented on HIVE-20523:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12940906/HIVE-20523.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 96 failed/errored test(s), 14994 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_map_emptynullvals] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_null_element] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
(batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_create] (batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_decimal1] (batchId=56)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] (batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_null] (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_arrays_of_ints] (batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_maps] (batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization] (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_nested_complex] (batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] (batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_read_backward_compatible_files]
(batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_schema_evolution] (batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_type_promotion] (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
(batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization] (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10] (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11] (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12] (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13] (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14] (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15] (batchId=92)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16] (batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17] (batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] (batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] (batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] (batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] (batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] (batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] (batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] (batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0] (batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit] (batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_nested_udf] (batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_not] (batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit]
(batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part] (batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
(batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_varchar]
(batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_pushdown] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows] (batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection] (batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types] (batchId=71)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_types] (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[resourceplan] (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[strict_managed_tables_sysdb]
(batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=167)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
(batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorization_input_format_excludes]
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_parquet] (batchId=171)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
(batchId=187)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] (batchId=118)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_0] (batchId=116)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_10] (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_11] (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_12] (batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_13] (batchId=133)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_14] (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_15] (batchId=149)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_16] (batchId=147)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_17] (batchId=122)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_1] (batchId=113)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_2] (batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_3] (batchId=144)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_4] (batchId=129)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_5] (batchId=141)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_6] (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_7] (batchId=148)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_8] (batchId=115)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_9] (batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_decimal_date]
(batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_div0] (batchId=145)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_limit] (batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_nested_udf]
(batchId=130)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_not] (batchId=145)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_offset_limit]
(batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part] (batchId=142)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_project]
(batchId=125)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_varchar]
(batchId=142)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_pushdown]
(batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
(batchId=130)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_parquet_projection]
(batchId=129)
org.apache.hadoop.hive.ql.io.parquet.TestParquetSerDe.testParquetHiveSerDe (batchId=286)
org.apache.hive.hcatalog.pig.TestParquetHCatStorer.testStoreFuncAllSimpleTypes (batchId=206)
org.apache.hive.hcatalog.pig.TestParquetHCatStorer.testWriteChar (batchId=206)
org.apache.hive.hcatalog.pig.TestParquetHCatStorer.testWriteVarchar (batchId=206)
org.apache.hive.jdbc.TestJdbcWithMiniHS2ErasureCoding.testDescribeErasureCoding (batchId=254)
org.apache.hive.jdbc.TestJdbcWithMiniHS2ErasureCoding.testExplainErasureCoding (batchId=254)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/13997/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/13997/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-13997/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 96 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12940906 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> -------------------------------------------
>
>                 Key: HIVE-20523
>                 URL: https://issues.apache.org/jira/browse/HIVE-20523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>         Attachments: HIVE-20523.1.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with any data
type in the Parquet format is 1. This is an underestimated value when columns are complex
data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less containers (mappers/reducers)
to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the ShuffleJoin that
can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account complex structures.
I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message