hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-17182) Invalid statistics like "RAW DATA SIZE" info for parquet file
Date Thu, 28 Sep 2017 07:42:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liyunzhang_intel updated HIVE-17182:
------------------------------------
    Description: 
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name            	data_type           	comment             
	 	 
ss_sold_time_sk     	bigint              	                    
ss_item_sk          	bigint              	                    
ss_customer_sk      	bigint              	                    
ss_cdemo_sk         	bigint              	                    
ss_hdemo_sk         	bigint              	                    
ss_addr_sk          	bigint              	                    
ss_store_sk         	bigint              	                    
ss_promo_sk         	bigint              	                    
ss_ticket_number    	bigint              	                    
ss_quantity         	int                 	                    
ss_wholesale_cost   	double              	                    
ss_list_price       	double              	                    
ss_sales_price      	double              	                    
ss_ext_discount_amt 	double              	                    
ss_ext_sales_price  	double              	                    
ss_ext_wholesale_cost	double              	                    
ss_ext_list_price   	double              	                    
ss_ext_tax          	double              	                    
ss_coupon_amt       	double              	                    
ss_net_paid         	double              	                    
ss_net_paid_inc_tax 	double              	                    
ss_net_profit       	double              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
ss_sold_date_sk     	bigint              	                    
	 	 
# Detailed Table Information	 	 
Database:           	tpcds_bin_partitioned_parquet_200	 
Owner:              	root                	 
CreateTime:         	Tue Jun 06 11:51:48 CST 2017	 
LastAccessTime:     	UNKNOWN             	 
Retention:          	0                   	 
Location:           	hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
	numFiles            	2023                
	numPartitions       	1824                
	numRows             	575995635           
	rawDataSize         	12671903970         
	totalSize           	46465926745         
	transient_lastDdlTime	1496721108          
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command,
it shows only 12G.


I tried to use "analyze table store_sales compute statistics for columns" to update the statistics
but there is no change for RAWDATASIZE;

I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute statistics no
scan" to update the statistics but fail, the error is 
{code}
2017-09-28T03:21:04,849  INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning] could not update
stats for tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed with
exception Missing timezone id for parquet int96 conversion!
java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96 conversion!
     at org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169)
     at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182)
     at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:89)
     at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:59)
     at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86)
     at org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)
{code}
a bug is tracked by HIVE-12767 

  was:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name            	data_type           	comment             
	 	 
ss_sold_time_sk     	bigint              	                    
ss_item_sk          	bigint              	                    
ss_customer_sk      	bigint              	                    
ss_cdemo_sk         	bigint              	                    
ss_hdemo_sk         	bigint              	                    
ss_addr_sk          	bigint              	                    
ss_store_sk         	bigint              	                    
ss_promo_sk         	bigint              	                    
ss_ticket_number    	bigint              	                    
ss_quantity         	int                 	                    
ss_wholesale_cost   	double              	                    
ss_list_price       	double              	                    
ss_sales_price      	double              	                    
ss_ext_discount_amt 	double              	                    
ss_ext_sales_price  	double              	                    
ss_ext_wholesale_cost	double              	                    
ss_ext_list_price   	double              	                    
ss_ext_tax          	double              	                    
ss_coupon_amt       	double              	                    
ss_net_paid         	double              	                    
ss_net_paid_inc_tax 	double              	                    
ss_net_profit       	double              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
ss_sold_date_sk     	bigint              	                    
	 	 
# Detailed Table Information	 	 
Database:           	tpcds_bin_partitioned_parquet_200	 
Owner:              	root                	 
CreateTime:         	Tue Jun 06 11:51:48 CST 2017	 
LastAccessTime:     	UNKNOWN             	 
Retention:          	0                   	 
Location:           	hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
	numFiles            	2023                
	numPartitions       	1824                
	numRows             	575995635           
	rawDataSize         	12671903970         
	totalSize           	46465926745         
	transient_lastDdlTime	1496721108          
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command,
it shows only 12G.


I tried to use "analyze table store_sales compute statistics for columns" to update the statistics
but there is no change for RAWDATASIZE;

I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute statistics no
scan" to update the statistics but fail, the error is 
{code}

{code}



> Invalid statistics like "RAW DATA SIZE" info for parquet file
> -------------------------------------------------------------
>
>                 Key: HIVE-17182
>                 URL: https://issues.apache.org/jira/browse/HIVE-17182
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> on TPC-DS 200g scale store_sales
> use "describe formatted store_sales" to view the statistics
> {code}
> hive> describe formatted store_sales;
> OK
> # col_name            	data_type           	comment             
> 	 	 
> ss_sold_time_sk     	bigint              	                    
> ss_item_sk          	bigint              	                    
> ss_customer_sk      	bigint              	                    
> ss_cdemo_sk         	bigint              	                    
> ss_hdemo_sk         	bigint              	                    
> ss_addr_sk          	bigint              	                    
> ss_store_sk         	bigint              	                    
> ss_promo_sk         	bigint              	                    
> ss_ticket_number    	bigint              	                    
> ss_quantity         	int                 	                    
> ss_wholesale_cost   	double              	                    
> ss_list_price       	double              	                    
> ss_sales_price      	double              	                    
> ss_ext_discount_amt 	double              	                    
> ss_ext_sales_price  	double              	                    
> ss_ext_wholesale_cost	double              	                    
> ss_ext_list_price   	double              	                    
> ss_ext_tax          	double              	                    
> ss_coupon_amt       	double              	                    
> ss_net_paid         	double              	                    
> ss_net_paid_inc_tax 	double              	                    
> ss_net_profit       	double              	                    
> 	 	 
> # Partition Information	 	 
> # col_name            	data_type           	comment             
> 	 	 
> ss_sold_date_sk     	bigint              	                    
> 	 	 
> # Detailed Table Information	 	 
> Database:           	tpcds_bin_partitioned_parquet_200	 
> Owner:              	root                	 
> CreateTime:         	Tue Jun 06 11:51:48 CST 2017	 
> LastAccessTime:     	UNKNOWN             	 
> Retention:          	0                   	 
> Location:           	hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
 
> Table Type:         	MANAGED_TABLE       	 
> Table Parameters:	 	 
> 	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
> 	numFiles            	2023                
> 	numPartitions       	1824                
> 	numRows             	575995635           
> 	rawDataSize         	12671903970         
> 	totalSize           	46465926745         
> 	transient_lastDdlTime	1496721108          
> {code}
> the rawDataSize is nearly 12G while the totalSize is nearly 46G.
> view the original data on hdfs
> {noformat}
> #hadoop fs -du -h /tmp/tpcds-generate/200/
> 75.8 G   /tmp/tpcds-generate/200/store_sales
> {noformat} 
> view the parquet file on hdfs
> {noformat}
> # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
> 43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
> {noformat}
> It seems that the rawDataSize is nearly 75G but in "describe formatted store_sales" command,
it shows only 12G.
> I tried to use "analyze table store_sales compute statistics for columns" to update the
statistics but there is no change for RAWDATASIZE;
> I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute statistics
no scan" to update the statistics but fail, the error is 
> {code}
> 2017-09-28T03:21:04,849  INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning] could not
update stats for tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed
with exception Missing timezone id for parquet int96 conversion!
> java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96 conversion!
>      at org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169)
>      at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182)
>      at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:89)
>      at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:59)
>      at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86)
>      at org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164)
>      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>      at java.lang.Thread.run(Thread.java:745)
> {code}
> a bug is tracked by HIVE-12767 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message