nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (NUTCH-656) DeleteDuplicates based on crawlDB only
Date Wed, 25 Sep 2013 10:30:02 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche reopened NUTCH-656:
---------------------------------

      Assignee: Julien Nioche

Am reopening this one as we need a generic deduplicator now that we have committed NUTCH-1047.
The SOLR dedup did work but was not very efficient as it required pulling all the documents
from SOLR and use them as an input to the MapReduce code. The patch attached implements a
generic deduplicator that can be used with any indexing backend (e.g. elasticsearch, cloudsearch
but also solr). The main difference with the SOLR-specific code is that the score is the one
from the crawldb and not from the index and the entries are currently not deleted from the
crawldb, just in the index.
Andrzej's point about having duplicates entries in the segment is still relevant but in practice
documents with the same URLs (i.e. unique in the crawldb) override each other when indexed
so they are already deduped in a way.


                
> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-656.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used when the
indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead to detect
URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the
content of the crawlDB contains all the elements we need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest are stored
in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL + fetch
time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature or score)
and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the indices.
> Any thoughts on this? Am I missing something?
> Julien

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message