spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <inv...@gmail.com>
Subject Fwd: Spark SQL reads all leaf directories on a partitioned Hive table
Date Thu, 08 Aug 2019 14:16:23 GMT
---------- Forwarded message ---------
From: Hao Ren <invkrh@gmail.com>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive
table
To: Gourav Sengupta <gourav.sengupta@gmail.com>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I
did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark
can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <gourav.sengupta@gmail.com>
wrote:

> Hi,
>
> Just out of curiosity did you start the SPARK session using
> enableHiveSupport() ?
>
> Or are you creating the table using SPARK?
>
>
> Regards,
> Gourav
>
> On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <invkrh@gmail.com> wrote:
>
>> Hi,
>> I am using Spark SQL 2.3.3 to read a hive table which is partitioned by
>> day, hour, platform, request_status and is_sampled. The underlying data is
>> in parquet format on HDFS.
>> Here is the SQL query to read just *one partition*.
>>
>> ```
>> spark.sql("""
>> SELECT rtb_platform_id, SUM(e_cpm)
>> FROM raw_logs.fact_request
>> WHERE day = '2019-08-01'
>> AND hour = '00'
>> AND platform = 'US'
>> AND request_status = '3'
>> AND is_sampled = 1
>> GROUP BY rtb_platform_id
>> """).show
>> ```
>>
>> However, from the Spark web UI, the stage description shows:
>>
>> ```
>> Listing leaf files and directories for 201616 paths:
>> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
>> ...
>> ```
>>
>> It seems the job is reading all of the partitions of the table and the
>> job takes too long for just one partition. One workaround is using
>> `spark.read.parquet` API to read parquet files directly. Spark has
>> partition-awareness for partitioned directories.
>>
>> But still, I would like to know if there is a way to leverage
>> partition-awareness via Hive by using `spark.sql` API?
>>
>> Any help is highly appreciated!
>>
>> Thank you.
>>
>> --
>> Hao Ren
>>
>

-- 
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


-- 
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France

Mime
View raw message