nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Created: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS
Date Sat, 14 Jul 2007 12:10:05 GMT
Indexer should only index pages with fetch status SUCCESS
---------------------------------------------------------

                 Key: NUTCH-514
                 URL: https://issues.apache.org/jira/browse/NUTCH-514
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: Doğacan Güney
            Priority: Minor
             Fix For: 1.0.0


Currently if you parse during fetch, nutch only parses pages which are successfully (i.e,
have a status STATUS_FETCH_SUCCESS). But, if you run parse as a seperate job, nutch parses
pages like "404 not found"s or "301 moved"s. Since most of these can be successfully parsed
these are indexed and show up in search results. 

IMO, we should either somehow mark contents so that a separate parse doesn't output non-STATUS_FETCH_SUCCESS
pages or we should filter them out in Indexer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message