nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1885) Protocol-file should treat symbolic links as redirects
Date Tue, 04 Nov 2014 21:58:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196917#comment-14196917
] 

Hudson commented on NUTCH-1885:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2848 (See [https://builds.apache.org/job/Nutch-trunk/2848/])
NUTCH-1483 (including NUTCH-1879, NUTCH-1880, NUTCH-1885) fix errors related to protocol-file
(snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1636736)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml
* /nutch/branches/2.x/conf/regex-normalize.xml.template
* /nutch/branches/2.x/src/java/org/apache/nutch/util/URLUtil.java
* /nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* /nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
* /nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
* /nutch/branches/2.x/src/test/org/apache/nutch/util/TestURLUtil.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/conf/regex-normalize.xml.template
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
* /nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* /nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
* /nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
* /nutch/trunk/src/test/org/apache/nutch/util/TestURLUtil.java


> Protocol-file should treat symbolic links as redirects
> ------------------------------------------------------
>
>                 Key: NUTCH-1885
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1885
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: protocol
>    Affects Versions: 1.9, 2.2.1
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.3, 1.10
>
>         Attachments: NUTCH-1885-2x-v1.patch, NUTCH-1885-trunk-v1.patch
>
>
> (reported by [~angela_wang], see NUTCH-1884, [[1|https://www.mail-archive.com/dev@nutch.apache.org/msg15614.html]]
and [[2|https://www.mail-archive.com/dev@nutch.apache.org/msg15610.html]])
> If a file is a symbolic link or contains a link on it's path:, protocol-file follows
the link immediately and returns a Content object with the canonical path (all symbolic links
resolved) in field "Location". This may cause
> - the Parse object not available under its expected URL (see NUTCH-1884)
> - dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic link to second
item):
> {noformat}
> file:/var/www/redir_test.html   Version: 7
> Status: 2 (db_fetched)
> ...
> Signature: null
> Metadata: 
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
> file:/var/www/test.html Version: 7
> Status: 2 (db_fetched)
> ...
> Signature: 50fa8436398f0ecb6b15eaba0574ef23
> Metadata: 
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
> {noformat}
> Because signature is null these will never result in duplicates in index.
> Protocol-file should instead explicitly redirect to the link target. This should be the
default, optionally we could add a property to restore the old behavior.
> Should not be difficult to resolve: FileResponse already has status "redirect" for symlinks,
but File.getProtocolOutput() then resolves the links internally. So we just need to return
a redirect response before links are resolved/followed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message