spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <>
Subject Spark SQL reads all leaf directories on a partitioned Hive table
Date Wed, 07 Aug 2019 14:27:39 GMT
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by
day, hour, platform, request_status and is_sampled. The underlying data is
in parquet format on HDFS.
Here is the SQL query to read just *one partition*.

SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id

However, from the Spark web UI, the stage description shows:

Listing leaf files and directories for 201616 paths:

It seems the job is reading all of the partitions of the table and the job
takes too long for just one partition. One workaround is using
`` API to read parquet files directly. Spark has
partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage
partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

Hao Ren

View raw message