spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ArtemisDev <>
Subject Structured Streaming using File Source - How to handle live files
Date Sun, 07 Jun 2020 17:41:51 GMT
We were trying to use structured streaming from file source, but had 
problems getting the files read by Spark properly.  We have another 
process generating the data files in the Spark data source directory on 
a continuous basis.  What we have observed was that the moment a data 
file is created before the data producing process finished, it was read 
by Spark immediately without reaching the EOF.  Then Spark will never 
revisit the file.  So we only ended up with empty data content.  The 
only way to make it to work is to produce the data files in a separate 
directory (e.g. /tmp) and move them to the Spark's file source dir after 
the data generation completes.

My questions:  Is this a behavior by design or is there any way to 
control the Spark streaming process not to import a file while it is 
still being used by another process?  In other words, do we have to use 
the tmp dir to move data files around or can the data producing process 
and Spark share the same directory?


-- Nick

To unsubscribe e-mail:

View raw message