spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Performance Issue
Date Thu, 10 Jan 2019 17:22:50 GMT
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find
out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser
automatically takes care of this) to have the LEFT JOIN on a smaller data
set by having joined on the device_id before as a subquery or separate
query. And when you are writing the output of the JOIN between csv_file and
raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <tzahi.file@ironsrc.com> wrote:

> Hi Gourav,
>
> My version of Spark is 2.1.
>
> The data is stored on S3 directory in parquet format.
>
> I sent you an example for a query I would like to run (the raw_e table is
> stored as parquet files and event_day is the partitioned filed):
>
> SELECT *
> FROM (select *
>       from parquet_files.raw_e as re
>       WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
> JOIN csv_file as g
> ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
> LEFT JOIN campaigns as c
> ON c.campaign_id = re.campaign_id
> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16,
> 17, 18, 19,20,21
>
> Looking forward to any insights.
>
>
> Thanks.
>
> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can you please let us know the SPARK version, and the query, and whether
>> the data is in parquet format or not, and where is it stored?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Jan 9, 2019 at 1:53 AM 大啊 <beliefer@163.com> wrote:
>>
>>> What is your performance issue?
>>>
>>>
>>>
>>>
>>>
>>> At 2019-01-08 22:09:24, "Tzahi File" <tzahi.file@ironsrc.com> wrote:
>>>
>>> Hello,
>>>
>>> I have some performance issue running SQL query on Spark.
>>>
>>> The query contains one parquet partitioned table (partition by date) one
>>> each partition is about 200gb and simple table with about 100 records. The
>>> spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface
>>> for running the SQL query.
>>>
>>> After searching after how to improve my query I have added to the
>>> configuration the above settings:
>>> spark.sql.shuffle.partitions=1000
>>> spark.dynamicAllocation.maxExecutors=200
>>>
>>> There wasn't any significant improvement. I'm looking for any ideas
>>> to improve my running time.
>>>
>>>
>>> Thanks!
>>> Tzahi
>>>
>>>
>>>
>>>
>>>
>>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.file@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Mime
View raw message