nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-357) crawling simulation
Date Fri, 06 Feb 2009 13:45:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671133#action_12671133
] 

Andrzej Bialecki  commented on NUTCH-357:
-----------------------------------------

Closing this issue - the suggested solution seems to address the problem in a sufficient way.

> crawling simulation
> -------------------
>
>                 Key: NUTCH-357
>                 URL: https://issues.apache.org/jira/browse/NUTCH-357
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. Reproducing
these problems is a kind of difficult, since first of all it is not polite to re-crawl a set
of pages again and again, secondly it is difficult to catch the page that cause a problem.

> Therefore it would be very useful to have a testbed to simulate crawls where  we can
control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it self, 
link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC or a webgraph
would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation
against page rank scores of the webgraph or evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message