nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Closed: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format
Date Tue, 15 Jan 2008 22:05:39 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki  closed NUTCH-528.
-----------------------------------


> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
>                 Key: NUTCH-528
>                 URL: https://issues.apache.org/jira/browse/NUTCH-528
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-528.patch, NUTCH-528_v2.patch, NUTCH-528_v3.patch
>
>
> * I've added improve the stats to list the number of urls by status and by hosts. This
is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:	36
> retry 0:	36
> min score:	0.0020
> avg score:	0.059
> max score:	1.0
> status 1 (db_unfetched):	33
>    www.yahoo.com :	33
> status 2 (db_fetched):	3
>    www.yahoo.com :	3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then be easy to
integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 01 08:00:00
CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu Jan
01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 2007;Thu
Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message