nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-710) Support for rel="canonical" attribute
Date Wed, 21 Apr 2010 10:02:52 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286
] 

Julien Nioche commented on NUTCH-710:
-------------------------------------

As suggested previously we could either treat canonicals as redirections or during deduplication.
Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical is not available
for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact
that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb & detect URLs
which have a canonical target already indexed or ready to be indexed. We need to follow up
to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects
to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of
C (intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate)
in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional
source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A -> B
B -> C
D -> C

where C is an indexable document (i.e. has been fetched and parsed - it may have been already
indexed.

will yield

A -> C
B -> C
D -> C

but also

C -> C

Once we have all possible redirections : go through the crawlDB in search of canonicals. if
the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning the whole crawlDB
twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some
of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs
from the initial URL having a canonical tag instead of generating these intermediate structures.
We can then modify the entries one by one instead of regenerating the whole crawlDB.

WDYT?



> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of URLs crawled
and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message