spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arush Kharbanda <>
Subject Re: Spark streaming - tracking/deleting processed files
Date Sat, 31 Jan 2015 08:33:02 GMT
Hi Ganterm,

Thats obvious. If you look at the documentation for textFileStream.

Create a input stream that monitors a Hadoop-compatible filesystem for new
files and reads them as text files (using key as LongWritable, value as
Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.

You need to move files to the directory when the system is up, You need to
manage that using shell script.  Moving files one at a time, moving them
out via another script.

On Fri, Jan 30, 2015 at 11:37 PM, ganterm <> wrote:

> We are running a Spark streaming job that retrieves files from a directory
> (using textFileStream).
> One concern we are having is the case where the job is down but files are
> still being added to the directory.
> Once the job starts up again, those files are not being picked up (since
> they are not new or changed while the job is running) but we would like
> them
> to be processed.
> Is there a solution for that? Is there a way to keep track what files have
> been processed and can we "force" older files to be picked up? Is there a
> way to delete the processed files?
> Thanks!
> Markus
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


[image: Sigmoid Analytics] <>

*Arush Kharbanda* || Technical Teamlead ||

View raw message