nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages
Date Tue, 19 Feb 2008 06:58:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570117#action_12570117
] 

Dennis Kubes commented on NUTCH-613:
------------------------------------

It seems to me that this code inside of the basic indexing filter is wrong and is what is
causing the problem:

    // url is both stored and indexed, so it's both searchable and returned
    doc.add(new Field("url",
                      reprUrlString == null ? urlString : reprUrlString,
                      Field.Store.YES, Field.Index.TOKENIZED));
    
    if (reprUrlString != null) {
      // also store original url as both stored and indexes
      doc.add(new Field("orig", urlString,
                        Field.Store.YES, Field.Index.TOKENIZED));
    }

Ok some background.  Fetcher goes to get page A called sourceA and gets redirected to targetA.
 Both sourceA and targetA are stored in segments and crawldb.  But sourceA doesn't have parseText,
parseData, or Content, only crawl fetch.  TargetA has everything.  TargetA in its metadata
has a reprURL possibly pointing to itself, possibly to a different version of itself due to
normalization, but more likely pointing to its source, in this case sourceA.  

Now we come to indexer.  Here we add the reprURL, sourceA as the url and the targetA as the
orig field.  Then when getting summary (before patch) it got the url field, sourceA, which
had no parse objects and hence no summaries and no content so null pointer trying to get cached
page.  IMO, url should point to targetA and orig should point to sourceA.  Essentially flipped
from what it is here.  

> Empty Summaries and Cached Pages
> --------------------------------
>
>                 Key: NUTCH-613
>                 URL: https://issues.apache.org/jira/browse/NUTCH-613
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, searcher, web gui
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 0.9.0, 1.0.0
>
>         Attachments: NUTCH-613-1-20080219.patch
>
>
> There is a bug where some search results do not have summaries and viewing their cached
pages causes a NullPointer.  This bug is due to redirects getting stored under the new url
and the getURL method of FetchedSegments getting the wrong (old) url which is stored in crawldb
but has no content or parse objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message