hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <>
Subject [jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
Date Tue, 05 Dec 2017 23:02:00 GMT


Ashutosh Chauhan commented on HIVE-18149:

Since ORC and parquet are most common formats these days, bumping up this ratio makes sense,
since columnar formats  usually compresses very well and then there is bloat in memory size
after this as well.

> Stats: rownum estimation from datasize underestimates in most cases
> -------------------------------------------------------------------
>                 Key: HIVE-18149
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>         Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are able to give
"raw size" estimation - I've checked orc; but I'm sure others will do the same....api docs
are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level "file-size-sums"
are used as the "raw data size" ; which is multiplied by the [deserialization ratio|]
; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize counts in the
online object headers..
> example; 20 rows are loaded into a partition [columnstats_partlvl_dp.q|]
> after HIVE-18108 [this explain|]
will estimate the rowsize of the table to be 404 bytes; however the 20 rows of text is only
169 it ends up with 0 rows...

This message was sent by Atlassian JIRA

View raw message