spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arnaud LARROQUE <alarro...@gmail.com>
Subject Re: Performance Issue
Date Sun, 13 Jan 2019 15:05:48 GMT
Hi,

Indeed Spark use spark.sql.autoBroadcastJoinThreshold to choose if it
autobroadcasts a dataset or not. Default value are 10 mb.
You may execute an explain and check the different plans and see if the
broadcasthashjoins are being used. You may change accordingly. There is no
use to increase too much as it will use too much memory on each executor.

You could also try to increase spark.sql.shuffle.partitions>=2001. In
version 2.0.x, I've tracked down than above this limit, partitions are
compressed and it may help removing some pressure on executors.

Can you tell more about your job :
- Garbage collecting pressure ?
- Nb total tasks vs nb executors (parallelism) with numbers of CPU and
Memory allocated for each executors

You can find all this inputs in Spark UI.

Regards,
Arnaud


On Sun, Jan 13, 2019 at 3:30 PM Tzahi File <tzahi.file@ironsrc.com> wrote:

> Hi Gourav,
>
> I tried to remove the left join to see how it influences on the
> performance.
> It was a difference of about 3 min only.
> So I'm looking for a solution that may decrease the running time more
> significantly (now the running time is about 2 hours)
>
> On Sun, Jan 13, 2019 at 1:12 PM Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
>> Hi Tzahi,
>>
>> I think that SPARK automatically broadcasts with the latest versions, but
>> you might have to check with your version. Did you try filtering first and
>> then doing the LEFT JOIN?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Jan 13, 2019 at 9:20 AM Tzahi File <tzahi.file@ironsrc.com>
>> wrote:
>>
>>> Hi Gourav,
>>>
>>> I just wanted to attach an example of my query so I replaced my fields
>>> names with  "select *", I do have an agg fields in my query.
>>>
>>> What about improving performance with Sparks - like broadcasting or
>>> something like that?
>>>
>>> Thanks,
>>> Tzahi
>>>
>>> On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>>> Hi Tzahi,
>>>>
>>>> by using GROUP BY without any aggregate columns are you just trying to
>>>> find out the DISTINCT of the columns ?
>>>>
>>>> Also it may be of help (I do not know whether the SQL optimiser
>>>> automatically takes care of this) to have the LEFT JOIN on a smaller data
>>>> set by having joined on the device_id before as a subquery or separate
>>>> query. And when you are writing the output of the JOIN between csv_file and
>>>> raw_e to ORDER BY the output based on campaign_ID.
>>>>
>>>> Thanks and Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>> On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <tzahi.file@ironsrc.com>
>>>> wrote:
>>>>
>>>>> Hi Gourav,
>>>>>
>>>>> My version of Spark is 2.1.
>>>>>
>>>>> The data is stored on S3 directory in parquet format.
>>>>>
>>>>> I sent you an example for a query I would like to run (the raw_e table
>>>>> is stored as parquet files and event_day is the partitioned filed):
>>>>>
>>>>> SELECT *
>>>>> FROM (select *
>>>>>       from parquet_files.raw_e as re
>>>>>       WHERE  re.event_day >= '2018-11-28' AND re.event_day <=
>>>>> '2018-12-28')
>>>>> JOIN csv_file as g
>>>>> ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
>>>>> LEFT JOIN campaigns as c
>>>>> ON c.campaign_id = re.campaign_id
>>>>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15,
>>>>> 16, 17, 18, 19,20,21
>>>>>
>>>>> Looking forward to any insights.
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <
>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Can you please let us know the SPARK version, and the query, and
>>>>>> whether the data is in parquet format or not, and where is it stored?
>>>>>>
>>>>>> Regards,
>>>>>> Gourav Sengupta
>>>>>>
>>>>>> On Wed, Jan 9, 2019 at 1:53 AM 大啊 <beliefer@163.com> wrote:
>>>>>>
>>>>>>> What is your performance issue?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> At 2019-01-08 22:09:24, "Tzahi File" <tzahi.file@ironsrc.com>
wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have some performance issue running SQL query on Spark.
>>>>>>>
>>>>>>> The query contains one parquet partitioned table (partition by
date)
>>>>>>> one each partition is about 200gb and simple table with about
100 records.
>>>>>>> The spark cluster is of type m5.2xlarge - 8 cores. I'm using
Qubole
>>>>>>> interface for running the SQL query.
>>>>>>>
>>>>>>> After searching after how to improve my query I have added to
the
>>>>>>> configuration the above settings:
>>>>>>> spark.sql.shuffle.partitions=1000
>>>>>>> spark.dynamicAllocation.maxExecutors=200
>>>>>>>
>>>>>>> There wasn't any significant improvement. I'm looking for any
ideas
>>>>>>> to improve my running time.
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Tzahi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Tzahi File
>>>>> Data Engineer
>>>>> [image: ironSource] <http://www.ironsrc.com/>
>>>>>
>>>>> email tzahi.file@ironsrc.com
>>>>> mobile +972-546864835
>>>>> fax +972-77-5448273
>>>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>>>> ironsrc.com <http://www.ironsrc.com/>
>>>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>>>> <https://plus.google.com/+ironsrc>
>>>>> This email (including any attachments) is for the sole use of the
>>>>> intended recipient and may contain confidential information which may
be
>>>>> protected by legal privilege. If you are not the intended recipient,
or the
>>>>> employee or agent responsible for delivering it to the intended recipient,
>>>>> you are hereby notified that any use, dissemination, distribution or
>>>>> copying of this communication and/or its content is strictly prohibited.
If
>>>>> you are not the intended recipient, please immediately notify us by reply
>>>>> email or by telephone, delete this email and destroy any copies. Thank
you.
>>>>>
>>>>
>>>
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.file@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.file@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Mime
View raw message