spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tathagata Das (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1312) Batch should read based on the batch interval provided in the StreamingContext
Date Wed, 24 Dec 2014 21:28:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258536#comment-14258536
] 

Tathagata Das commented on SPARK-1312:
--------------------------------------

This has probably been solved in Spark 1.2.0 with changes in how blocks are assigned to batches.

> Batch should read based on the batch interval provided in the StreamingContext
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1312
>                 URL: https://issues.apache.org/jira/browse/SPARK-1312
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 0.9.0
>            Reporter: Sanjay Awatramani
>            Assignee: Tathagata Das
>            Priority: Critical
>              Labels: sliding, streaming, window
>
> This problem primarily affects sliding window operations in spark streaming.
> Consider the following scenario:
> - a DStream is created from any source. (I've checked with file and socket)
> - No actions are applied to this DStream
> - Sliding Window operation is applied to this DStream and an action is applied to the
sliding window.
> In this case, Spark will not even read the input stream in the batch in which the sliding
interval isn't a multiple of batch interval. Put another way, it won't read the input when
it doesn't have to apply the window function. This is happening because all transformations
in Spark are lazy.
> How to fix this or workaround it (see line#3):
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 * 60 *
1000));
> JavaDStream<String> inputStream = stcObj.textFileStream("/Input");
> inputStream.print(); // This is the workaround
> JavaDStream<String> objWindow = inputStream.window(new Duration(windowLen*60*1000),
new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
> The "Window operations" example on the streaming guide implies that Spark will read the
stream in every batch, which is not happening because of the lazy transformations.
> Wherever sliding window would be used, in most of the cases, no actions will be taken
on the pre-window batch, hence my gut feeling was that Streaming would read every batch if
any actions are being taken in the windowed stream.
> As per Tathagata,
> "Ideally every batch should read based on the batch interval provided in the StreamingContext."
> Refer the original thread on http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html
for more details, including Tathagata's conclusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message