nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
Date Tue, 20 Sep 2011 12:38:10 GMT


Julien Nioche commented on NUTCH-1052:

I like the original idea and agree that having to read/write the whole crawldb once more would
be a pain for large crawls. This is a good example of what 2.0 could add (or could have added
if you are pessimistic). 

I agree with your suggestion for an alternative to the use of null as value which is to encode
the action (add, delete) either as a complex object in the key or as part of the value. The
latter would make more sense as it is unlikely that we'd add AND delete the same document
as part of the same batch. Could you include that in your patch?

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>                 Key: NUTCH-1052
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
> The SolrClean class does not keep track of purged URLs, it only checks the URL status
for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For
small, stable crawl databases this is not a problem. For larger crawls this could be an issue.
SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then
check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message