spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?
Date Tue, 04 Nov 2014 18:46:41 GMT
This is the expected behavior. Spark Streaming only reads new files once,
this is why they must be created through an atomic move so that Spark
doesn't accidentally read a partially written file. I'd recommend looking
at "Basic Sources" in the Spark Streaming guide ( ).

On Tue, Nov 4, 2014 at 10:41 AM, spr <> wrote:

> I am trying to implement a use case that takes some human input.  Putting
> that in a single file (as opposed to a collection of HDFS files) would be a
> simpler human interface, so I tried an experiment with whether Spark
> Streaming (via textFileStream) will recognize a new version of a filename
> it
> has already digested.  (Yes, I'm deleting and moving a new file into the
> same name, not modifying in place.)  It appears the answer is No, it does
> not recognize a new version.  Can one of the experts confirm a) this is
> true
> and b) this is intended?
> Experiment:
> - run an existing program that works to digest new files in a directory
> - modify the data-creation script to put the new files always under the
> same
> name instead of different names, then run the script
> Outcome:  it sees the first file under that name, but none of the
> subsequent
> files (with different contents, which would show up in output).
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Cell : 425-233-8271

View raw message