spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-17513) StreamExecution should discard unneeded metadata
Date Tue, 13 Sep 2016 00:15:21 GMT


Apache Spark reassigned SPARK-17513:

    Assignee: Apache Spark

> StreamExecution should discard unneeded metadata
> ------------------------------------------------
>                 Key: SPARK-17513
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Frederick Reiss
>            Assignee: Apache Spark
> The StreamExecution maintains a write-ahead log of batch metadata in order to allow repeating
previously in-flight batches if the driver is restarted. StreamExecution does not garbage-collect
or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume memory on
the HDFS NameNode. Specifically, each log file will consume about 300 bytes of NameNode memory
(150 bytes for the inode and 150 bytes for the block of file contents; see [].
An application with a 100 msec batch interval will increase the NameNode's heap usage by about
250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log when restarting.
This read operation will be very expensive if the log contains millions of entries spread
over millions of files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message