nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Joke (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format
Date Fri, 28 Dec 2007 02:59:43 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Emmanuel Joke updated NUTCH-528:
--------------------------------

    Attachment: NUTCH-528_v3.patch

New path provided following Andrzej recommandations:

??* CrawlDatum.getMetaData().toString() can easily break the CSV format, it's enough
that some of the keys or values contain literal double quotes or semicolons, not to
mention line breaks. Either you ignore the metadata, or you need to pass this string
through a method that will escape special characters that could break the format.??
==> I've removed the MetaData. I don't think its really important in this format.

??* CrawlDbReader.stats.sort: this property name doesn't follow the de facto
convention that we try to keep when adding new property names. I suggest
db.reader.stats.sort, and it should be added in the appropriate section of
nutch-default.xml??
==> I've also change the property CrawlDbReader.topN to be in phase with the convention.
We don't need to add them in the config file, its juts internal config which are set by the
args parameter in main method.

??* I think that processDumpJob should not accept a String format, and parse it
internally. In my opinion this should be the caller's responsibility, and the
argument here should be an int constant.??
==> You're right. Its now done

I took advantage of this new patch to make some modification on all classes implementing Hadoop
Mapper/Reducer in order to remove minor error saw in Eclipse.

> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
>                 Key: NUTCH-528
>                 URL: https://issues.apache.org/jira/browse/NUTCH-528
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-528.patch, NUTCH-528_v2.patch, NUTCH-528_v3.patch
>
>
> * I've added improve the stats to list the number of urls by status and by hosts. This
is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:	36
> retry 0:	36
> min score:	0.0020
> avg score:	0.059
> max score:	1.0
> status 1 (db_unfetched):	33
>    www.yahoo.com :	33
> status 2 (db_fetched):	3
>    www.yahoo.com :	3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then be easy to
integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 01 08:00:00
CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu Jan
01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 2007;Thu
Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message