spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Tasks are skewed to one executor
Date Mon, 12 Apr 2021 13:57:07 GMT
Sounds like there is mileage in salting. Other parameters may play a role
as well.

How many partitions did you use in your groupBy (account,product, salt)?

It may not be large enough.

Try increasing it with some randomised uniformly distributed integer
between  0-n where

bins = 20   ## play around with this number

df2 = df.withColumn("salt", (rand * bins ).cast("integer"))

then

groupBy(account,product,salt)

Try this and change bins as needed

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 12 Apr 2021 at 09:44, András Kolbert <kolbertandras@gmail.com>
wrote:

> The cluster is on prem, not in the cloud.
>
> Adding the partition id to the group by helped slightly but the most
> benefits I got when I took out the following property:
>
> spark.sql.autoBroadcastJoinThreshold = -1
>
> When I have this on, the process caches the big dataframe to a couple of
> executors, and that makes them very vulnerable for OOM.
>
> When the autoBroadcastJoin is on, the executors still periodically die. My
> assumption is that having autoBroadcastJOin on, means that during
> processing Spark broadcasts the smaller dataframes each time to their
> memory, without deleting them from there and gradually my executors die
> with OOM.
>
> This is after 48minutes runtime, 13 dead executors.
>
> [image: image.png]
>
> [image: image.png]
>
> I tried to broadcast(dataframe) manually, by specifing broadcast
> explicitly, and then unpersisting when it changes. Does not seem to work.
>
> Any idea what i should do?
>
> Other parameters that I specified:
> [image: image.png]
>
>
>
>
> On Mon, 12 Apr 2021 at 10:02, Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
>> Hi,
>>
>> looks like you have answered some questions whcih I generally ask.
>> Another thing, can you please let me know the environment? Is it AWS, GCP,
>> Azure, Databricks, HDP, etc?
>>
>> Regards,
>> Gourav
>>
>> On Sun, Apr 11, 2021 at 8:39 AM András Kolbert <kolbertandras@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Sure!
>>>
>>> Application:
>>> - Spark version 2.4
>>> - Kafka Stream (DStream, from a kafka 0.8 brokers)
>>> - 7 executors, 2cores, 3700M memory size
>>>
>>> Logic:
>>> - Process initialises a dataframe that contains metrics for an
>>> account/product metrics (e.g. {"account":A, "product": X123, "metric"; 51}
>>> - After initialisation, the dataframe is persisted on HDFS (dataframe is
>>> around 1GB total size in memory)
>>> - Streaming:
>>> - each bach, processes incoming data, unions the main dataframe with the
>>> new account/product/metric interaction dataframe, aggregates the total, and
>>> then persist on HDFS again (each batch we save the total dataframe again)
>>> - The screenshot I sent earlier, was after this aggregation, and how all
>>> the data seems to be ended up on the same executor. That could explain why
>>> the executor periodically dies with OOM.
>>>
>>> Mich, I hope this provides extra information :)
>>>
>>> Thanks
>>> Andras
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, 10 Apr 2021 at 16:42, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Can you provide a bit more info please?
>>>>
>>>> How are you running this job and what is the streaming
>>>> framework (kafka, files etc)?
>>>>
>>>> HTH
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 10 Apr 2021 at 14:28, András Kolbert <kolbertandras@gmail.com>
>>>> wrote:
>>>>
>>>>> hi,
>>>>>
>>>>> I have a streaming job and quite often executors die (due to memory
>>>>> errors/ "unable to find location for shuffle etc) during the processing.
I
>>>>> started digging and found that some of the tasks are concentrated to
one
>>>>> executor, just as below:
>>>>> [image: image.png]
>>>>>
>>>>> Can this be the reason?
>>>>> Should I repartition the underlying data before I execute a groupby on
>>>>> the top of it?
>>>>>
>>>>> Any advice is welcome
>>>>>
>>>>> Thanks
>>>>> Andras
>>>>>
>>>>

Mime
View raw message