nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format
Date Thu, 27 Dec 2007 11:51:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554560
] 

Andrzej Bialecki  commented on NUTCH-528:
-----------------------------------------

Thanks for a gentle reminder :) After reviewing the patch v2 I have several comments:

* CrawlDatum.getMetaData().toString() can easily break the CSV format, it's enough that some
of the keys or values contain literal double quotes or semicolons, not to mention line breaks.
Either you ignore the metadata, or you need to pass this string through a method that will
escape special characters that could break the format.

* CrawlDbReader.stats.sort: this property name doesn't follow the de facto convention that
we try to keep when adding new property names. I suggest db.reader.stats.sort, and it should
be added in the appropriate section of nutch-default.xml

* I think that processDumpJob should not accept a String format, and parse it internally.
In my opinion this should be the caller's responsibility, and the argument here should be
an int constant.

* the section that parses input arguments should indicate bad arguments - but this patch removes
this warning.

* a minor issue: the patch uses inconsistent whitespace (e.g. {{if(sort){}}, {{if(st.length
>2 )}}, or {{ format = args[i=i+2];}}), this should be fixed so that it follows the coding
convention.

> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
>                 Key: NUTCH-528
>                 URL: https://issues.apache.org/jira/browse/NUTCH-528
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-528.patch, NUTCH-528_v2.patch
>
>
> * I've added improve the stats to list the number of urls by status and by hosts. This
is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:	36
> retry 0:	36
> min score:	0.0020
> avg score:	0.059
> max score:	1.0
> status 1 (db_unfetched):	33
>    www.yahoo.com :	33
> status 2 (db_fetched):	3
>    www.yahoo.com :	3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then be easy to
integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 01 08:00:00
CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu Jan
01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 2007;Thu
Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message