spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 宿荣全 (JIRA) <>
Subject [jira] [Commented] (SPARK-4734) [Streaming]limit the file Dstream size for each batch
Date Fri, 05 Dec 2014 09:29:12 GMT


宿荣全 commented on SPARK-4734:

I am very sorry that I can't describe the suggestion clearly.
This suggestion is about capping the amount of data in each batch.
The processing input data's ability of a cluster per batch duration is certain,but capping
the amount of data in each batch is not steady per batch duration.Sometimes big sometimes
small or even no input data.If we limit the max capping the amount of data in each batch that
is best to the max size is near the processing ability of the cluster per batch duration.So
a input-data will be processing in next several batch durations when the amount of data is
very large. App possible will still processing earlier stage input data when effectively no
input data in this batch duration.
for example:
-- per batch duration:2S
-- The processing input data's ability of a cluster per batch duration: 50M
-- HDFS(other system's output files):
-- 8:00~22:00 : Add some files the amount of data is 100M per batch duration.
-- 22:00~tomorrow 8:00 : Add some files the amount of data is 3M per batch duration.
-- currient streaming : 
   processing time far exceed batch duration,and lead to sheduling delay several hours,and
final be terminated by operating system. 
-- this patch:
   keep the app processing 50M per batch duration in [8:00~22:00].Postponed the surplus input
data processing.The amount of data is more than 3M in [22:00~tomorrow 8:00].

> [Streaming]limit the file Dstream size for each batch
> -----------------------------------------------------
>                 Key: SPARK-4734
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
> Streaming scan new files form the HDFS and process those files in each batch process.Current
streaming exist some problems:
> 1.When the number of files is very large(the count size of those files is very large)
in some batch segement.The processing time required will become very long.The processing time
maybe over slideDuration time.Eventually lead to dispatch the next batch process is delay.
> 2.when the size of total file Dstream  is very large in one batch,those  dstream data
do shuffle after memory will be n times increasing occupation,app will be slow or even terminated
by operating system.
> So if we set a upper limit value of input data for each batch to control the batch process
time,the job dispatch delay and the process delay wil be alleviated.
> modification:
> Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream (input data
base class).the size of each batch process segments  will be set in this parameter from [spark-defaults.conf]
or setting in source.
> all implements class of InputDStream will do corresponding action be aimed at the segmentSizeThreshold.
> This patch is a modification about FileInputDStream ,so when find new files      ,put
those files's name and size in a queue and take elements package to a batch data with totail
size < segmentSizeThreshold  in FileInputDStream.Please look source about detailed logic.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message