spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shixiong Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19036) Merging dealyed micro batches
Date Mon, 02 Jan 2017 20:05:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shixiong Zhu updated SPARK-19036:
---------------------------------
    Component/s:     (was: Structured Streaming)
                 DStreams

> Merging dealyed micro batches
> -----------------------------
>
>                 Key: SPARK-19036
>                 URL: https://issues.apache.org/jira/browse/SPARK-19036
>             Project: Spark
>          Issue Type: New Feature
>          Components: DStreams
>            Reporter: Sungho Ham
>            Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly distributed by both
time and message. Theses skews make streaming batch delayed despite sufficient resources.

> Merging small-sized delayed batches could help increase efficiency of micro batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |            1   <--  current time |
> |3 |            1   |
> |2 |             1  |
> |1 |            1000      <-- processing |
> After long-running batch (t=1),  three batches has only one message. These batches cannot
utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can
process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing parallelism
matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages.
Then, merging three RDDs still could considerably improve efficiency.
> ||batch_time  ||  messages ||           
> | 4 |                  1+10       <--  current time |
> | 3 |                  1+10           |
> | 2 |                  1+10            |
> | 1 |                  1000      <-- processing  |
> There could be two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message