nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Susam Pal (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option
Date Fri, 15 Feb 2008 20:58:08 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Susam Pal updated NUTCH-601:
----------------------------

    Attachment: NUTCH-601v1.0.patch

Attached another patch (NUTCH-601v1.0.patch) that always deletes the old mergex index as per
the suggestion of Andrzej.

The v0.4 patch would leave the old merged index with the new segments in case something goes
wrong during the generation of new index. Whether the index merger fails or succeeds, we will
always have an 'index' directory. So, after the completion of a recrawl, a user may want to
verify whether the 'index' directory is the new merged index or the old merged index. This
may be confusing.

However, one advantage is that one can run a recrawl on the same crawl directory which the
web-gui is using to serve the users. This patch minimizes the duration for which the index
directory would be unavailable.

The v1.0 patch always deletes the old indexes as well as old merged index. Therefore, the
old index would never remain once the index generation has begun. If the index merger fails,
we won't have an 'index' directory which would be a clear indication of index generation failure.
This prevents the confusion discussed above.

Please review both the patches and accept whichever the community feels is better.

> Recrawling on existing crawl directory using force option
> ---------------------------------------------------------
>
>                 Key: NUTCH-601
>                 URL: https://issues.apache.org/jira/browse/NUTCH-601
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, NUTCH-601v0.3.patch, NUTCH-601v1.0.patch
>
>
> Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one
can crawl and recrawl in the following manner:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> This option can be used for the first crawl too:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> If one tries to crawl without the -force option when the crawl directory already exists,
he/she finds a small warning along with the error message:
> {code}
> # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> Exception in thread "main" java.lang.RuntimeException: crawl already
> exists. Add -force option to recrawl.
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message