Have you made sure that the saveastable stores them as parquet?

On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com> wrote:

we are using parquet tables, is it causing any performance issue?

On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfranke@gmail.com> wrote:
Improving the performance of Hive can be also done by switching to Tez+llap as an engine.
Aside from this : you need to check what is the default format that it writes to Hive. One issue for the slow storing into a hive table could be that it writes by default to csv/gzip or csv/bzip2

> On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com> wrote:
>
> Yes we tried hive and want to migrate to spark for better performance. I am using paraquet tables . Still no better performance while loading.
>
> Sent from my iPhone
>
>> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>>
>> Have you tried directly in Hive how the performance is?
>>
>> In which Format do you expect Hive to write? Have you made sure it is in this format? It could be that you use an inefficient format (e.g. CSV + bzip2).
>>
>>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <mdkhajaasmath@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I have written spark sql job on spark2.0 by using scala . It is just pulling the data from hive table and add extra columns , remove duplicates and then write it back to hive again.
>>>
>>> In spark ui, it is taking almost 40 minutes to write 400 go of data. Is there anything that I need to improve performance .
>>>
>>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb and dynamic allocation enabled.
>>>
>>> I am doing insert overwrite on partition by
>>> Da.write.mode(overwrite).insertinto(table)
>>>
>>> Any suggestions please ??
>>>
>>> Sent from my iPhone
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>