nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Closed: (NUTCH-357) crawling simulation
Date Fri, 06 Feb 2009 13:45:59 GMT


Andrzej Bialecki  closed NUTCH-357.

    Resolution: Won't Fix
      Assignee: Andrzej Bialecki 

> crawling simulation
> -------------------
>                 Key: NUTCH-357
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>         Attachments: protocol-simulation-pluginV1.patch
> We recently discovered  some serious issue related to crawling and scoring. Reproducing
these problems is a kind of difficult, since first of all it is not polite to re-crawl a set
of pages again and again, secondly it is difficult to catch the page that cause a problem.

> Therefore it would be very useful to have a testbed to simulate crawls where  we can
control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it self, 
link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC or a webgraph
would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation
against page rank scores of the webgraph or evaluaing crawling strategies.    

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message