spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-30281) 'archive' option in FileStreamSource misses to consider partitioned and recursive option
Date Tue, 17 Dec 2019 05:50:00 GMT
Jungtaek Lim created SPARK-30281:
------------------------------------

             Summary: 'archive' option in FileStreamSource misses to consider partitioned
and recursive option
                 Key: SPARK-30281
                 URL: https://issues.apache.org/jira/browse/SPARK-30281
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


Cleanup option for FileStreamSource is introduced in SPARK-20568.

To simplify the condition of verifying archive path, it took the fact that FileStreamSource
reads the files where these files meet one of conditions: 1) parent directory matches the
source pattern 2) the file itself matches the source pattern.

We found there're other cases during post-hoc review which invalidate above fact: partitioned,
and recursive option. With these options, FileStreamSource can read the arbitrary files in
subdirectories which match the source pattern, so simply checking the depth of archive path
doesn't work.

We need to restore the path check logic, though it would be not easy to explain to end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message