nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help
Date Fri, 17 Apr 2015 18:52:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500430#comment-14500430
] 

Hudson commented on NUTCH-1906:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #3065 (See [https://builds.apache.org/job/Nutch-trunk/3065/])
Fix for NUTCH-1906 Typo in CrawlDbReader command line help contributed by Michael Joyce <mltjoyce@gmail.com>.
This closes #20. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1674374)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java


> Typo in CrawlDbReader command line help
> ---------------------------------------
>
>                 Key: NUTCH-1906
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1906
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Chris A. Mattmann
>            Priority: Trivial
>             Fix For: 1.10
>
>
> Currently the CrawlDbReader tool, when invoked without any command line arguments helps
us as follows
> {code}
> [mdeploy@crawl local]$ ./bin/nutch readdb
> Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn>
<out_dir> [<min>] | -url <url>)
> 	<crawldb>	directory name where crawldb is located
> 	-stats [-sort] 	print overall statistics to System.out
> 		[-sort]	list status sorted by host
> 	-dump <out_dir> [-format normal|csv|crawldb]	dump the whole db to a text file
in <out_dir>
> 		[-format csv]	dump in Csv format
> 		[-format normal]	dump in standard format (default option)
> 		[-format crawldb]	dump as CrawlDB
> 		[-regex <expr>]	filter records with expression
> 		[-retry <num>]	minimum retry count
> 		[-status <status>]	filter records by CrawlDatum status
> 	-url <url>	print information on <url> to System.out
> 	-topN <nnnn> <out_dir> [<min>]	dump top <nnnn> urls sorted by
score to <out_dir>
> 		[<min>]	skip records with scores below this value.
> 			This can significantly improve performance.
> {code}
> The code that bothers me is
> {code}
> 	-stats [-sort] 	print overall statistics to System.out
> 		[-sort]	list status sorted by host
> {code}
> The inclusion of the double -sort is not necessary or required.
> Having looked through the code there is no other optional flag which we can substitute
for the second one (which I thought may lead to this being a placeholder for something else)
therefore we can just remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message