spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Structured Streaming using File Source - How to handle live files
Date Sat, 13 Jun 2020 09:45:08 GMT
Hi,

Yeah we generally read files from hdfs or object stores like S3, gcs, etc
where files cannot be updated.

Regards
Gourav

On Sun, 7 Jun 2020, 22:36 Jungtaek Lim, <kabhwan.opensource@gmail.com>
wrote:

> Hi Nick,
>
> I guess that's by design - Spark assumes the input file will not be
> modified once it is placed on the input path. This makes Spark easy to
> track the list of processed files vs unprocessed files. Assume input files
> can be modified, then Spark will have to enumerate all of files and track
> how many lines/bytes it reads "per file", even the bad case it may read the
> incomplete line (if the writer doesn't guarantee that) and crash or bring
> incorrect results.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <artemis@dtechspace.com> wrote:
>
>> We were trying to use structured streaming from file source, but had
>> problems getting the files read by Spark properly.  We have another
>> process generating the data files in the Spark data source directory on
>> a continuous basis.  What we have observed was that the moment a data
>> file is created before the data producing process finished, it was read
>> by Spark immediately without reaching the EOF.  Then Spark will never
>> revisit the file.  So we only ended up with empty data content.  The
>> only way to make it to work is to produce the data files in a separate
>> directory (e.g. /tmp) and move them to the Spark's file source dir after
>> the data generation completes.
>>
>> My questions:  Is this a behavior by design or is there any way to
>> control the Spark streaming process not to import a file while it is
>> still being used by another process?  In other words, do we have to use
>> the tmp dir to move data files around or can the data producing process
>> and Spark share the same directory?
>>
>> Thanks!
>>
>> -- Nick
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message