spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <inv...@gmail.com>
Subject Re: Spark SQL reads all leaf directories on a partitioned Hive table
Date Wed, 14 Aug 2019 10:07:41 GMT
Thank you, Subash. It works!

On Tue, Aug 13, 2019 at 5:58 AM Subash Prabakar <subashprabakar@gmail.com>
wrote:

> I had the similar issue reading the external parquet table . In my case I
> had permission issue in one partition so I added filter to exclude that
> partition but still the spark didn’t prune it. Then I read that in order
> for spark to be aware of all the partitions it first read the folders and
> then updated its metastore . Then the sql is applied on TOP of it. Instead
> of using the existing hive SerDe and this property is only for parquet
> files.
>
> Hive metastore Parquet table conversion
> <https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion>
>
> When reading from and writing to Hive metastore Parquet tables, Spark SQL
> will try to use its own Parquet support instead of Hive SerDe for better
> performance. This behavior is controlled by the
> spark.sql.hive.convertMetastoreParquetconfiguration, and is turned on by
> default.
>
> Reference:
> https://spark.apache.org/docs/2.3.0/sql-programming-guide.html
>
> Set the above property to false . It should work.
>
> If anyone have better explanation please let me know - I have same
> question. Why only parquet has this problem ?
>
> Thanks
> Subash
>
> On Fri, 9 Aug 2019 at 16:18, Hao Ren <invkrh@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Thank you for your reply.
>> I need to be more clear about the environment. I am using spark-shell to
>> run the query.
>> Actually, the query works even without core-site, hdfs-site being under
>> $SPARK_HOME/conf.
>> My problem is efficiency. Because all of the partitions was scanned
>> instead of the one in question during the execution of the spark sql query.
>> This is why this simple query takes too much time.
>> I would like to know how to improve this by just reading the specific
>> partition in question.
>>
>> Feel free to ask more questions if I am not clear.
>>
>> Best regards,
>> Hao
>>
>> On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> also need others as well using soft link ls -l
>>>
>>> cd $SPARK_HOME/conf
>>>
>>> hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
>>> core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
>>> hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 8 Aug 2019 at 15:16, Hao Ren <invkrh@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Hao Ren <invkrh@gmail.com>
>>>> Date: Thu, Aug 8, 2019 at 4:15 PM
>>>> Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive
>>>> table
>>>> To: Gourav Sengupta <gourav.sengupta@gmail.com>
>>>>
>>>>
>>>> Hi Gourva,
>>>>
>>>> I am using enableHiveSupport.
>>>> The table was not created by Spark. The table already exists in Hive.
>>>> All I did is just reading it by using SQL query in Spark.
>>>> FYI, I put hive-site.xml in spark/conf/ directory to make sure that
>>>> Spark can access to Hive.
>>>>
>>>> Hao
>>>>
>>>> On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Just out of curiosity did you start the SPARK session using
>>>>> enableHiveSupport() ?
>>>>>
>>>>> Or are you creating the table using SPARK?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <invkrh@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am using Spark SQL 2.3.3 to read a hive table which is partitioned
>>>>>> by day, hour, platform, request_status and is_sampled. The underlying
data
>>>>>> is in parquet format on HDFS.
>>>>>> Here is the SQL query to read just *one partition*.
>>>>>>
>>>>>> ```
>>>>>> spark.sql("""
>>>>>> SELECT rtb_platform_id, SUM(e_cpm)
>>>>>> FROM raw_logs.fact_request
>>>>>> WHERE day = '2019-08-01'
>>>>>> AND hour = '00'
>>>>>> AND platform = 'US'
>>>>>> AND request_status = '3'
>>>>>> AND is_sampled = 1
>>>>>> GROUP BY rtb_platform_id
>>>>>> """).show
>>>>>> ```
>>>>>>
>>>>>> However, from the Spark web UI, the stage description shows:
>>>>>>
>>>>>> ```
>>>>>> Listing leaf files and directories for 201616 paths:
>>>>>> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
>>>>>> ...
>>>>>> ```
>>>>>>
>>>>>> It seems the job is reading all of the partitions of the table and
>>>>>> the job takes too long for just one partition. One workaround is
using
>>>>>> `spark.read.parquet` API to read parquet files directly. Spark has
>>>>>> partition-awareness for partitioned directories.
>>>>>>
>>>>>> But still, I would like to know if there is a way to leverage
>>>>>> partition-awareness via Hive by using `spark.sql` API?
>>>>>>
>>>>>> Any help is highly appreciated!
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> --
>>>>>> Hao Ren
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Hao Ren
>>>>
>>>> Software Engineer in Machine Learning @ Criteo
>>>>
>>>> Paris, France
>>>>
>>>>
>>>> --
>>>> Hao Ren
>>>>
>>>> Software Engineer in Machine Learning @ Criteo
>>>>
>>>> Paris, France
>>>>
>>>
>>
>> --
>> Hao Ren
>>
>> Software Engineer in Machine Learning @ Criteo
>>
>> Paris, France
>>
>

-- 
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France

Mime
View raw message