spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <inv...@gmail.com>
Subject Spark SQL reads all leaf directories on a partitioned Hive table
Date Wed, 07 Aug 2019 11:52:48 GMT
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by
day, hour, platform, request_status and is_sampled. The underlying data is
in parquet format on HDFS.
Here is the SQL query to read just *one partition*.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
...
```

It seems the job is reading all of the partitions of the table and the job
takes too long for just one partition. One workaround is using
`spark.read.parquet` API to read parquet files directly. Spark has
partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage
partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

-- 
Hao Ren

Mime
View raw message