nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] Commented: (NUTCH-710) Support for rel="canonical" attribute
Date Wed, 21 Apr 2010 10:02:52 GMT


Julien Nioche commented on NUTCH-710:

As suggested previously we could either treat canonicals as redirections or during deduplication.
Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical is not available
for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact
that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb & detect URLs
which have a canonical target already indexed or ready to be indexed. We need to follow up
to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects
to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of
C (intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate)
in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional
source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A -> B
B -> C
D -> C

where C is an indexable document (i.e. has been fetched and parsed - it may have been already

will yield

A -> C
B -> C
D -> C

but also

C -> C

Once we have all possible redirections : go through the crawlDB in search of canonicals. if
the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning the whole crawlDB
twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some
of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs
from the initial URL having a canonical tag instead of generating these intermediate structures.
We can then modify the entries one by one instead of regenerating the whole crawlDB.


> Support for rel="canonical" attribute
> -------------------------------------
>                 Key: NUTCH-710
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> Adding support for this attribute value will potentially reduce the number of URLs crawled
and indexed and reduce duplicate page content.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message