spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
Date Sun, 03 Jun 2018 04:10:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen reassigned SPARK-24356:
---------------------------------

    Assignee: Misha Dmitriev

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> ------------------------------------------------------------------
>
>                 Key: SPARK-24356
>                 URL: https://issues.apache.org/jira/browse/SPARK-24356
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.3.0
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: SPARK-24356.01.patch, dup-file-strings-details.png
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from high GC
pressure due to high object churn. Analysis was done with the jxray tool ([www.jxray.com)|http://www.jxray.com)/]
that checks a heap dump for a number of well-known memory issues. One problem that it found
in this dump is 19.5% of memory wasted due to duplicate strings. Of these duplicates, more
than a half come from {{FileInputStream.path}} and {{File.path}}. All the {{FileInputStream}}
objects that JXRay shows are garbage - looks like they are used for a very short period and
then discarded (I guess there is a separate question of whether that's a good pattern). But {{File}}
instances are traceable to {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}}
field. Here is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very similar, so
I think {{FileInputStream}}s are generated by the {{FileSegmentManagedBuffer}} code. Instances
of {{File}}, in turn, likely come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested that in the
above code we create a File with a complete, normalized pathname, that has been already interned.
This will prevent the code inside {{java.io.File}} from modifying this string, and thus it
will use the interned copy, and will pass it to FileInputStream. Essentially the current line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message