spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From spr <>
Subject Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?
Date Tue, 04 Nov 2014 19:18:04 GMT
Holden Karau wrote
> This is the expected behavior. Spark Streaming only reads new files once,
> this is why they must be created through an atomic move so that Spark
> doesn't accidentally read a partially written file. I'd recommend looking
> at "Basic Sources" in the Spark Streaming guide (
> ).

Thanks for the quick response.

OK, this does seem consistent with the rest of Spark Streaming.  Looking at
Basic Sources, it says
"Once moved [into the directory being observed], the files must not be
changed."  I don't think of removing a file and creating a new one under the
same name as "changing" the file, i.e., it has a different inode number.  It
might be more precise to say something like "Once a filename has been
detected by Spark Streaming, it will be viewed as having been processed for
the life of the context."  This end-case also implies that any
filename-generating code has to be certain it will not create repeats within
the life of a context, which is not easily deduced from the existing

Thanks again.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message