spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From spr <...@yarcdata.com>
Subject Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?
Date Tue, 04 Nov 2014 19:18:04 GMT
Holden Karau wrote
> This is the expected behavior. Spark Streaming only reads new files once,
> this is why they must be created through an atomic move so that Spark
> doesn't accidentally read a partially written file. I'd recommend looking
> at "Basic Sources" in the Spark Streaming guide (
> http://spark.apache.org/docs/latest/streaming-programming-guide.html ).

Thanks for the quick response.

OK, this does seem consistent with the rest of Spark Streaming.  Looking at
Basic Sources, it says
"Once moved [into the directory being observed], the files must not be
changed."  I don't think of removing a file and creating a new one under the
same name as "changing" the file, i.e., it has a different inode number.  It
might be more precise to say something like "Once a filename has been
detected by Spark Streaming, it will be viewed as having been processed for
the life of the context."  This end-case also implies that any
filename-generating code has to be certain it will not create repeats within
the life of a context, which is not easily deduced from the existing
description.

Thanks again.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074p18076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message