spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Hive external table not working in sparkSQL when subdirectories are present
Date Wed, 07 Aug 2019 07:13:54 GMT
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share
some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <rishikeshg1996@gmail.com>:
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all sparkSQL isn't
able to descend into the subdirectories over which the table is created. Could there be any
other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <rishikeshg1996@gmail.com>
wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data stored
in ORC format. This directory has several subdirectories inside it, each of which contains
the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data from
other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as hive.mapred.supports.subdirectories=TRUE
and mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from ExtTable
via the Hive CLI, it successfully gives me the expected count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for getting
the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that this
works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external table,
orc, sparksql, yarn

Mime
View raw message