nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1324) DupeDB for Nutch
Date Tue, 12 Nov 2013 09:26:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819947#comment-13819947
] 

Markus Jelsma commented on NUTCH-1324:
--------------------------------------

Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> database where
the DupeDatum is a compound type of digest, URL path section, domain. The Text is the host
part of the URL. This is generated by reading the CrawlDB. This DupeDB is then ingested by
NUTCH-1326 together with NUTCH-1325 to output rules for NUTCH-1319.

All these things are for solving the duplicate host problem in the CrawlDB  by using a HostNormalizer.
We crawled the internet (without filtering rules) for over a year. We quickly saw the fetcher
fetching the same pages from the same domains over and over. The most typical host duplication
is a website accessible over http://www.example.org/ and http://example.org/. This means twice
as many unique URL's for many domains. You can not use manual URL filters to solve the problem,
nor can you manually edit the HostNormalizer on this scale.

These tools make it happen automatically.

Here's an  example of two DupeDB entries for the common www-problem (the first three columns
make up the DupeDatum, the right is the host. The DupeDatum is the key in M/R):
a218daf4a39ed75b24d977bb90394a11        /grande-bretagne-c-248.html     annuaire-loisirs-seniors.fr
annuaire-loisirs-seniors.fr
a218daf4a39ed75b24d977bb90394a11        /grande-bretagne-c-248.html     annuaire-loisirs-seniors.fr
www.annuaire-loisirs-seniors.fr

Here's a more interesting problem:
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz znacky.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz siku-farmer.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz impag.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz koleje.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz lifetime.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz penove-dekorace.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz grand.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz maxi.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz groovy-pets.katalog-hracek.cz



> DupeDB for Nutch
> ----------------
>
>                 Key: NUTCH-1324
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1324
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>
> A DupeDB for Nutch and associated tools to create and read a database containing information
on duplicates.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message