nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x
Date Mon, 07 Jan 2013 10:48:15 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545766#comment-13545766
] 

Ferdy Galema commented on NUTCH-1508:
-------------------------------------

NUTCH-1431 (aka 'distance' concept) only defines a global one. However, for an internal branch
I created a hack that allows to specify it on a per host-basis using the host table. Not very
clean.

I think NUTCH-1331 is the better approach, because it is indeed less intrusive and because
it allows to define a scoring instead of ignoring depth-exceeding urls. (Also to keep 1.x
and 2.x differences at a minimum). So when this gets implemented for 2.x we can throw away
the changes in NUTCH-1431.
                
> Port limit crawler to defined depth to 2.x
> ------------------------------------------
>
>                 Key: NUTCH-1508
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1508
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message