spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "andrzej.stankevich@gmail.com (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-25062) Clean up BlockLocations in FileStatus objects
Date Thu, 09 Aug 2018 01:41:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

andrzej.stankevich@gmail.com updated SPARK-25062:
-------------------------------------------------
    Description: 
When Spark lists collection of files it does it on a driver or creates tasks to list files depending
on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]

If spark creates tasks to list files each task creates one FileStatus object per file. Before
sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On
driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation
object for each FileStatus using 

 
{code:java}
new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
{code}
 

After deserialization on a driver side BlockLocation doesn't have a lot of information that
original HDFSBlockLocation had.

 

If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and
they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes
more memory than if it would created on executor side.

 

Later Spark puts all this objects into _SharedInMemoryCache_ and that cache takes 2.2x more
memory if files were listed on driver side than if were listed on executor side.

 

In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver
for about 19K files.

  was:
When Spark lists collection of files it does it on a driver or creates tasks to list files depending
on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]

If spark creates tasks to list files each task creates one FileStatus object per file. Before
sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On
driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation
object for each FileStatus using 

 
{code:java}
new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
{code}
 

After deserialization on a driver side BlockLocation doesn't have a lot of information that
original HDFSBlockLocation had.

 

If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and
they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes
more memory than if it would created on executor side.

 

Later Spark puts all this objects into _SharedInMemoryCache_ cache and that cache takes 2.2x
more memory if files were listed on driver side than if were listed on executor side.

 

In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver
for about 19K files.


> Clean up BlockLocations in FileStatus objects
> ---------------------------------------------
>
>                 Key: SPARK-25062
>                 URL: https://issues.apache.org/jira/browse/SPARK-25062
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.2
>            Reporter: andrzej.stankevich@gmail.com
>            Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks to list
files depending on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object per file.
Before sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus.
On driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation
object for each FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of information
that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects
and they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes
more memory than if it would created on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache takes 2.2x
more memory if files were listed on driver side than if were listed on executor side.
>  
> In our case it takes 125M when we do scan on executors  and 270M when we do it on a
driver for about 19K files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message