nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-656) DeleteDuplicates based on crawlDB only
Date Sun, 29 Sep 2013 14:25:24 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781389#comment-13781389
] 

Sebastian Nagel commented on NUTCH-656:
---------------------------------------

Hi Julien, hi Markus,

regarding robustness: what happens in a continuous crawl two duplicate documents happen their
order regarding score? Previously, A had a higher score than B, consequently B has been removed
from index. Now B gets the higher score, and DeduplicationJob will remove A from index. The
current solr-dedup is immune because in the second call only A is retrieved from Solr and
there is no need for deduplication. 

For crawlDb-based deduplication deduplicated docs/urls must be flagged in CrawlDb so that
the index status is reflected in CrawlDb. Deduplication jobs then can draw decisions dependent
on previous deduplications/deletions. Also status changes from "duplicate" to "not modified"
could treated in a save way by forcing a re-index (and re-fetch if required).

After duplicates are flagged in CrawlDb, deletion of duplicates than could be done by indexing
jobs. Also indexing backends (eg. CSV) which cannot really delete documents (deletion means
not to index) will profit. In addition, re-fetch scheduling of duplicate docs could be altered
to lower priority in a scoring filter. 

Flagging is possible in CrawlDatums meta data. But a new db status DUPLICATE, although a significant
change, may be more explicit and efficient. And it would be simpler to combine various dedup
jobs, eg, first by canonical links (NUTCH-710), second by signature. It's clear that docs
of status DUPLICATE need no second (possibly contradicting) deduplication.


> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-656.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used when the
indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead to detect
URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the
content of the crawlDB contains all the elements we need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest are stored
in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL + fetch
time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature or score)
and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the indices.
> Any thoughts on this? Am I missing something?
> Julien



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message