nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
Date Fri, 18 Mar 2011 12:08:29 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008410#comment-13008410
] 

Julien Nioche commented on NUTCH-963:
-------------------------------------

Re-dedup on SOLR side : good point, although the SOLR dedup is based on signature only IIRC
and does not take the score of a doc into account. 
The dedup/404 remover would allow to do one or both of these operations so that people can
deactivate what they don't need.

We're not likely to have the new deduplication any time soon anyway so am definitely OK for
adding the 404 remover in 1.3, provided as you said that is has been tested and reviewed

> Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-963
>                 URL: https://issues.apache.org/jira/browse/NUTCH-963
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 2.0
>            Reporter: Claudio Martella
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't
exist anymore and return 404).
> This patch creates a new command in the indexer that scans the crawldb looking for these
urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message