spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Does Spark Streaming need to list all the files in a directory?
Date Sun, 02 Aug 2015 08:03:26 GMT
I guess it goes through that 500k files
<https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193>for
the first time and then use a filter from next time.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das <tdas@databricks.com> wrote:

> For the first time it needs to list them. AFter that the list should be
> cached by the file stream implementation (as far as I remember).
>
>
> On Thu, Jul 30, 2015 at 3:55 PM, Brandon White <bwwinthehouse@gmail.com>
> wrote:
>
>> Is this a known bottle neck for Spark Streaming textFileStream? Does it
>> need to list all the current files in a directory before he gets the new
>> files? Say I have 500k files in a directory, does it list them all in order
>> to get the new files?
>>
>
>

Mime
View raw message