spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: fileStream with old files
Date Wed, 15 Jul 2015 04:01:01 GMT
It was added, but its not documented publicly. I am planning to change the
name of the conf to spark.streaming.fileStream.minRememberDuration to make
it easier to understand

On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole <hujie.eagle@gmail.com> wrote:

> A new configuration named *spark.streaming.minRememberDuration* was added
> since 1.2.1 to control the file stream input, the default value is *60
> seconds*, you can change this value to a large value to include older
> files (older than 1 minute)
>
> You can get the detail from this jira:
> https://issues.apache.org/jira/browse/SPARK-3276
>
> -Terry
>
> On Tue, Jul 14, 2015 at 4:44 AM, automaticgiant <
> hunter.morgan@rackspace.com> wrote:
>
>> It's not as odd as it sounds. I want to ensure that long streaming job
>> outages can recover all the files that went into a directory while the job
>> was down.
>> I've looked at
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-DStream-by-existing-textfiles-td20030.html#a20039
>> and
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-td14306.html#a16435
>> and
>>
>> https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469#29036469?newreg=e7e25469132d4fbc8350be8f876cf81e
>> , but all seem unhelpful.
>> I've tested combinations of the following:
>>  * fileStreams created with dumb accept-all filters
>>  * newFilesOnly true and false,
>>  * tweaking minRememberDuration to high and low values,
>>  * on hdfs or local directory.
>> The problem is that it will not read files in the directory from more
>> than a
>> minute ago.
>> JavaPairInputDStream<LongWritable, Text> input = context.fileStream(indir,
>> LongWritable.class, Text.class, TextInputFormat.class, v -> true, false);
>> Also tried with having set:
>>
>> context.sparkContext().getConf().set("spark.streaming.minRememberDuration",
>> "1654564"); to big/small.
>>
>> Are there known limitations of the onlyNewFiles=false? Am I doing
>> something
>> wrong?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/fileStream-with-old-files-tp23802.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message