nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs
Date Tue, 03 Oct 2006 19:50:22 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-371?page=comments#action_12439643 ] 
            
Andrzej Bialecki  commented on NUTCH-371:
-----------------------------------------

I think we need to change DeleteDuplicates to implement the following algorithm:

Step 1: delete URL duplicates, keeping the most recent document

Step 2: delete content duplicates, keeping the one with the highest score (or optionally the
one with the shortest url?)

The order of these steps is important: first we need to ensure that we will keep the most
recent versions of the pages - currently dedup removes by content hash first, which may delete
newer documents and keep older ones ... oops. Indexer doesn't check this either - see NUTCH-378
for more details.

This requires storing fetchTime in the index, which automatically solves NUTCH-95.

The second step would keep the best scoring pages and discard all others. Or perhaps we should
keep the shortest urls?

Finally, we really, really need a JUnit test for this - I already started writing one, stay
tuned.

> DeleteDuplicates should remove documents with duplicate URLs
> ------------------------------------------------------------
>
>                 Key: NUTCH-371
>                 URL: http://issues.apache.org/jira/browse/NUTCH-371
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Chris Schneider
>
> DeleteDuplicates is supposed to delete documents with duplicate URLs (after deleting
documents with identical MD5 hashes), but this part is apparently not yet implemented. Here's
the comment from DeleteDuplicates.java:
> // 2. map indexes -> <<url, fetchdate>, <index,doc>>
> // partition by url
> // reduce, deleting all but most recent.
> //
> // Part 2 is not yet implemented, but the Indexer currently only indexes one
> // URL per page, so this is not a critical problem.
> It is apparently also known that re-fetching the same URL (e.g., one month later) will
result in more than one document with the same URL (this is alluded to in NUTCH-95), but the
comment above suggests that the indexer will solve the problem before DeleteDuplicates, because
it will only index one document per URL.
> This is not necessarily the case if the segments are to be divided among search servers,
as each server will have its own index built from its own portion of the segments. Thus, if
the URL in question was fetched in different segments, and these segments end up assigned
to different search servers, then the indexer can't be relied on to eliminate the duplicates.
> Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., deleting
documents with duplicate URLs) needs to be implemented. I agree with Byron and Andrzej that
the most recently fetched document (rather than the one with the highest score) should be
preserved.
> Finally, it's also possible to get duplicate URLs in the segments without re-fetching
an expired URL in the crawldb. This can happen if 3 different URLs all redirect to the target
URL. This is yet another consequence of handling redirections immediately, rather than adding
the target URL to the crawldb for fetching in some subsequent segment (see NUTCH-273).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message