nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erol (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-601) Recrawling on existing crawl directory using force option
Date Thu, 28 Feb 2008 19:47:51 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573418#action_12573418
] 

Erol commented on NUTCH-601:
----------------------------

Hello,

I tested this patch and for now it works. I had few problems, but Susam help me :D 

I have only one question, request. As I'm checking right now, it looks that it checks existing
crawl folder and recrawl all the sites again, but is it possible to filter them out? So to
only crawl sites that we set?

otherwise, I think it very useful patch..

> Recrawling on existing crawl directory using force option
> ---------------------------------------------------------
>
>                 Key: NUTCH-601
>                 URL: https://issues.apache.org/jira/browse/NUTCH-601
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, NUTCH-601v0.3.patch, NUTCH-601v1.0.patch
>
>
> Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one
can crawl and recrawl in the following manner:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> This option can be used for the first crawl too:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> If one tries to crawl without the -force option when the crawl directory already exists,
he/she finds a small warning along with the error message:
> {code}
> # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> Exception in thread "main" java.lang.RuntimeException: crawl already
> exists. Add -force option to recrawl.
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message