[ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986043#comment-16986043
]
Hudson commented on NUTCH-2748:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch-trunk #3656 (See [https://builds.apache.org/job/Nutch-trunk/3656/])
NUTCH-2748 Fetch status gone (redirect exceeded) not to overwrite (snagel: [https://github.com/apache/nutch/commit/969a1943939703e524f7e50185dfa03db8bd419b])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
* (edit) conf/nutch-default.xml
> Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
> --------------------------------------------------------------------------------
>
> Key: NUTCH-2748
> URL: https://issues.apache.org/jira/browse/NUTCH-2748
> Project: Nutch
> Issue Type: Bug
> Components: crawldb, fetcher
> Affects Versions: 1.16
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.17
>
> Attachments: test-NUTCH-2748.zip
>
>
> If fetcher is following redirects and the max. number of redirects in a redirect chain
(http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone"
and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will
set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as
there has been no fetch of the final redirect target and indeed nothing is know about it's
status. An wrong db_gone may then cause that a page gets deleted from the search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead
of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if http.redirect.max ==
0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as it's close
to the current behavior without the risk to accidentally set successful fetches to db_gone
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
|