spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Egor Pahomov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
Date Fri, 10 Feb 2017 21:25:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861869#comment-15861869
] 

Egor Pahomov commented on SPARK-19524:
--------------------------------------

[~sowen], probably yes. I don't know. "Should process only new files and ignore existing files
in the directory" if you really think about it, than I agree than setting this field to false
does not mean to process old files. IMHO, everything around this field seems to be poorly
documented or architectured. Since there is no documentation about spark.streaming.minRememberDuration
in http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not feel very
comfortable changing it. More than that, it would be strange to change it to process old files,
when purpose of this field very different. And nevertheless I was given an API with newFilesOnly,
about which I made false assumption, but not totally unreasonable, based on all accessible
documentation. I was wrong, but it still feels like a trap, I walked into, which can easily
not be there. 

> newFilesOnly does not work according to docs. 
> ----------------------------------------------
>
>                 Key: SPARK-19524
>                 URL: https://issues.apache.org/jira/browse/SPARK-19524
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams
>    Affects Versions: 2.0.2
>            Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message