spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 宿荣全 (JIRA) <j...@apache.org>
Subject [jira] [Created] (SPARK-4734) limit the file Dstream size for each batch
Date Thu, 04 Dec 2014 03:06:12 GMT
宿荣全 created SPARK-4734:
--------------------------

             Summary: limit the file Dstream size for each batch
                 Key: SPARK-4734
                 URL: https://issues.apache.org/jira/browse/SPARK-4734
             Project: Spark
          Issue Type: New Feature
          Components: Streaming
            Reporter: 宿荣全
            Priority: Minor


Streaming scan new files form the HDFS and process those files in each batch process.Current
streaming exist some problems:
1.When the number of files is very large(the count size of those files is very large) in some
batch segement.The processing time required will become very long.The processing time maybe
over slideDuration time.Eventually lead to dispatch the next batch process is delay.
2.when the size of total file Dstream  is very large in one batch,those  dstream data do shuffle
after memory will be n times increasing occupation,app will be slow or even terminated by
operating system.

So if we set a upper limit value of input data for each batch to control the batch process
time,the job dispatch delay and the process delay wil be alleviated.

modification:
Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream (input data base
class).the size of each batch process segments  will be set in this parameter from [spark-defaults.conf.template]
or setting in source.
all implements class of InputDStream will do corresponding action be aimed at the segmentSizeThreshold.
This patch is a modification about FileInputDStream ,so when find new files      ,put those
files's name and size in a queue and take elements package to a batch data with totail size
< segmentSizeThreshold  in FileInputDStream.Please look source about detailed logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message