nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2214) Index clean to be flexible on what it deletes
Date Wed, 10 Feb 2016 11:17:18 GMT
Markus Jelsma created NUTCH-2214:
------------------------------------

             Summary: Index clean to be flexible on what it deletes
                 Key: NUTCH-2214
                 URL: https://issues.apache.org/jira/browse/NUTCH-2214
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.11
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.13


Nutch clean removes all useless records, but if Nutch is configured correctly (-deleteGone
etc), the index should only contain duplicates, if existing. On a large index, this could
result in Nutch sending millions of getById's to Solr, for records that don't exist in the
first place.

This issue will make it configurable on what to delete, e.g. useless records (404, 30x) or
duplicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message