spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <>
Subject Re: Spark streaming - tracking/deleting processed files
Date Sat, 31 Jan 2015 08:21:50 GMT
This might not be a straight forward approach, but one way would be to use
the *PairRDDFunctions* and then you have a few methods to access the
partitions and the filenames from the partitions. And once you have the
filename, you can delete it after your operations. Not sure if spark
updated the api though, but you can give a try.

Here's a snippet:

​    ​
UnionPartition upp = (UnionPartition) ds.values().getPartitions()[
    NewHadoopPartition npp = (NewHadoopPartition) upp.split();
    String fPath = npp.serializableHadoopSplit().value().toString();

Here fPath would be the first file's name in the stream. And ds is a

Best Regards

On Fri, Jan 30, 2015 at 11:37 PM, ganterm <> wrote:

> We are running a Spark streaming job that retrieves files from a directory
> (using textFileStream).
> One concern we are having is the case where the job is down but files are
> still being added to the directory.
> Once the job starts up again, those files are not being picked up (since
> they are not new or changed while the job is running) but we would like
> them
> to be processed.
> Is there a solution for that? Is there a way to keep track what files have
> been processed and can we "force" older files to be picked up? Is there a
> way to delete the processed files?
> Thanks!
> Markus
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message