nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Chen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest
Date Tue, 18 Dec 2007 23:34:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935
] 

Joseph Chen commented on NUTCH-579:
-----------------------------------

I changed the db.signature.class and this seems to solve the problem when I first do a crawl.

Now I'm seeing a similar problem when I try to merge the results of two crawls.  I performed
two separate crawls using the crawl tool.  I wanted to merge the results of the two crawls.
 Here are the steps I did:

1) Merged the segments from the two crawls
2) Inverted links
3) Merged the crawldb
4) Indexed the segments
5) Dedup the index
6) Merged the indexes

I noticed a problem after running the dedup.  My original index had about 8000 documents (corresponding
to feed posts) and after merging I ended up with about half that number (4000 documents).

Examining the index via Luke shows that I'm back down to one post feed - each document has
a unique digest value. 
When I skip the dedup step (step 5), the number of documents is around 17000, and examining
this index shows multiple posts from a feed.

I searched for the db.signature.class value in the DeleteDuplicates.java class, which is the
class that gets called when running bin/nutch dedup, but I didn't see any references to this
value.

Any ideas about this issue?

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this
is that the digest, which is calculated for based on the content (or the url if the content
is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual
feed entries were being indexed properly but then when the dedup step ran, my merged index
ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding
the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message