spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Spark hive overwrite is very very slow
Date Sun, 20 Aug 2017 17:46:25 GMT
Ah i see then I would check also directly in Hive if you have issues to insert data in the
Hive table. Alternatively you can try to register the df as temptable and do a insert into
the Hive table from the temptable using Spark sql ("insert into table hivetable select * from
temptable")


You seem to use Cloudera so you probably have a very outdated Hive version. So you could switch
to a distribution having a recent version of Hive 2 with Tez+llap - these are much more performant
with much more features.

Alternatively you can try to register the df as temptable and do a insert into the Hive table
from the temptable using Spark sql ("insert into table hivetable select * from temptable")

> On 20. Aug 2017, at 18:47, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com> wrote:
> 
> Hi,
> 
> I have created hive table in impala first with storage format as parquet. With dataframe
from spark I am tryinig to insert into the same table with below syntax.
> 
> Table is partitoned by year,month,day 
> ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")
> 
> https://issues.apache.org/jira/browse/SPARK-20049
> 
> I saw something in the above link not sure if that is same thing in my case.
> 
> Thanks,
> Asmath
> 
>> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>> Have you made sure that the saveastable stores them as parquet?
>> 
>>> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com>
wrote:
>>> 
>>> we are using parquet tables, is it causing any performance issue?
>>> 
>>>> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>> Improving the performance of Hive can be also done by switching to Tez+llap
as an engine.
>>>> Aside from this : you need to check what is the default format that it writes
to Hive. One issue for the slow storing into a hive table could be that it writes by default
to csv/gzip or csv/bzip2
>>>> 
>>>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com>
wrote:
>>>> >
>>>> > Yes we tried hive and want to migrate to spark for better performance.
I am using paraquet tables . Still no better performance while loading.
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>> >>
>>>> >> Have you tried directly in Hive how the performance is?
>>>> >>
>>>> >> In which Format do you expect Hive to write? Have you made sure
it is in this format? It could be that you use an inefficient format (e.g. CSV + bzip2).
>>>> >>
>>>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com>
wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I have written spark sql job on spark2.0 by using scala . It
is just pulling the data from hive table and add extra columns , remove duplicates and then
write it back to hive again.
>>>> >>>
>>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go
of data. Is there anything that I need to improve performance .
>>>> >>>
>>>> >>> Spark.sql.partitions is 2000 in my case with executor memory
of 16gb and dynamic allocation enabled.
>>>> >>>
>>>> >>> I am doing insert overwrite on partition by
>>>> >>> Da.write.mode(overwrite).insertinto(table)
>>>> >>>
>>>> >>> Any suggestions please ??
>>>> >>>
>>>> >>> Sent from my iPhone
>>>> >>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>> >>>
>>> 
> 

Mime
View raw message