nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] Updated: (NUTCH-784) CrawlDBScanner
Date Mon, 01 Feb 2010 14:33:51 GMT


Julien Nioche updated NUTCH-784:

    Attachment: NUTCH-784.patch

> CrawlDBScanner 
> ---------------
>                 Key: NUTCH-784
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-784.patch
> The patch file contains a utility which dumps all the entries matching a regular expression
on their URL. The dump mechanism of the crawldb reader is not  very useful on large crawldbs
as the ouput can be extremely large and the -url  function can't help if we don't know what
url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary
objects which can then be used as a new CrawlDB. 
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>]
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise
it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below : 
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump* -s db_fetched
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb
matching the regexp* and having a status of db_fetched

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message