nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator
Date Wed, 18 Jul 2012 10:23:33 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1431:
--------------------------------

    Attachment: NUTCH-1431.patch
    
> Introduce link 'distance' and add configurable max distance in the generator
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1431
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1431
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1431.patch
>
>
> Introducing a new feature that enables to crawl URLs within a specific distance (shortest
path) from the injected source urls. This is where the db-updater of Nutchgora really shines.
Because every url in the reducer has all of its inlinks present, it is really easy to determine
what the shortest path is to that url. (I would not know how to cleanly implement this feature
for trunk).
> Injected urls have distance 0. Outlink urls on those pages have distance 1. Outlinks
on those pages have distance 2, etc. Outlinks that already had a smaller distance will keep
that distance. Of all inlinks to a page, it will always select the smallest distance in order
to maintain the shortest path garantuee.
> Generator now has a property 'generate.max.distance' (default set to -1) that specifies
the maximum allowed distance of urls to select for fetch.
> Note that this is fundamentally different from the concept crawl 'depth'. Depth is used
for crawl cycles. Distance allows to crawl for unlimited number of cycles AND always stay
within a certain number of 'hops' from injected urls.
> I will attach a patch. Will commit in a few days. (It does not change crawl behaviour
unless otherwise configured). Let me know if you have comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message