spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
Date Wed, 18 Dec 2019 08:36:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998935#comment-16998935
] 

Jungtaek Lim commented on SPARK-30294:
--------------------------------------

Working on the fix. I might bring the solution first which opens the chance to optimize for
read-only state store, and try to go with workaround solution if the community is not happy
with the solution.

> Read-only state store unnecessarily creates and deletes the temp file for delta file
every batch
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30294
>                 URL: https://issues.apache.org/jira/browse/SPARK-30294
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
>     /** Abort all the updates made on this store. This store will not be usable any more.
*/
>     override def abort(): Unit = {
>       // This if statement is to ensure that files are deleted only if there are changes
to the
>       // StateStore. We have two StateStores for each task, one which is used only for
reading, and
>       // the other used for read+write. We don't want the read-only to delete state files.
>       if (state == UPDATING) {
>         state = ABORTED
>         cancelDeltaFile(compressedStream, deltaFileStream)
>       } else {
>         state = ABORTED
>       }
>       logInfo(s"Aborted version $newVersion for $this")
>     } {code}
> Despite of the comment, read-only state store also does the same things for preparing
write - creates the temporary file, initializes output streams for the file, closes these
output streams, and deletes the temporary file. That is just unnecessary and gives confusion
as according to the log messages two different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message