spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jan Filipiak (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
Date Thu, 19 Dec 2019 04:27:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999730#comment-16999730
] 

Jan Filipiak edited comment on SPARK-30246 at 12/19/19 4:26 AM:
----------------------------------------------------------------

Hello, we are facing similiiar issues at the moment,

hence I am also looking into this. The cleanup logic seems pretty legitimate. Could you list
all the incomming references to StreamState::associatedChannel from your dump. 

I think its's either because there is no read timeout on the Network channel (associatedChannel
would have incoming references from netty) or the connectionInactive handler isn't called
on read timeouts (that would be a bug in code and not the config and the associatedChannel
would have no incoming references from netty).

 

 


was (Author: jfilipiak):
Hello, we are facing similiiar issues at the moment,

hence I am also looking into this. The cleanup logic seems pretty legitimate. Could you list
all the incomming references to StreamState::associatedChannel from your dump. 

I think its's either because there is not read timeout on the Network channel (associatedChannel
would have incoming references from netty) or the connectionInactive handler isn't called
on read timeouts (that would be a bug in code and not the config and the associatedChannel
would have no incoming references from netty).

 

 

> Spark on Yarn External Shuffle Service Memory Leak
> --------------------------------------------------
>
>                 Key: SPARK-30246
>                 URL: https://issues.apache.org/jira/browse/SPARK-30246
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.4.3
>         Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>            Reporter: huangweiyi
>            Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as part of
YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects still in heap,
but the app which the StreamState belongs to is already finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message