spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From emlyn <Emlyn.Cor...@microsoft.com>
Subject Re: CBO not working for Parquet Files
Date Thu, 06 Sep 2018 09:56:40 GMT
rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.

We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.
 
I’ve also found this ticket:
https://issues.apache.org/jira/browse/SPARK-25185, but there it has no
response so far.
 
I suspect we must be missing something, as it seems that partitioned parquet
files would be a common use case, and if this is a bug in Spark I would have
expected it to have been picked up sooner.
 
Has anybody managed to get cbo working with partitioned parquet files? Is
this a known issue?
 
Thanks,
Emlyn



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message