nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ogjunk-nu...@yahoo.com
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Fri, 18 Apr 2008 22:05:52 GMT
You are both in agreement, but I don't fully follow as I'm not intimately familiar with all
the files and structures yet.

- Fetcher-s putting info about hosts into crawl_fetch for each fetched segment makes sense.
 I see Fetcher(2) uses FetcherOutputFormat, which has its own RecordWriter, which then writes
CrawlDatum to HDFS.  I do not see where exactly to plug per-host info writing in the current
Fetcher2 flow.  I think the thing to do would be to simply collect the data in memory and
at the end of the fetch run, at the end of the fetch(....) method write it out.  I just don't
know how to write it out to HDFS without relying on Reduce to do the writing for me.

Should it be something as simple as the following?

        // write a plain-text file with space-delimited values
        FileSystem hdfs = FileSystem.get(getConf());
        FSDataOutputStream dos = hdfs.create(path);
        dos.writeUTF(host + " " + requests + " " + timeouts..... );
        dos.close();

- I don't understand how per-host info can go in the CrawlDb.  Isn't CrawlDb a database of
all known *URLs*?  Doesn't CrawlDb contain only CrawlDatum records, and doesn't each CrawlDatum
hold data about a single URL?  So if I wanted to record, say, the number of timeouts for a
given host, how would I add that to a CrawlDatum, when a CrawlDatum is for a specific URL,
and not host?

I do understand that CrawlDb is the source to get all known URLs from,
and from those URLs we can extract host names, domains, etc. (what
DomainStatistics tool does), but I don't understand how you'd use CrawlDb as the source of
per-host data, since CrawlDb does not have aggregate per-host data.  Shouldn't that live in
a separate file, a file that can updated after every fetch run?


Thanks,
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Doğacan Güney (JIRA) <jira@apache.org>
To: nutch-dev@lucene.apache.org
Sent: Friday, April 18, 2008 2:40:21 PM
Subject: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information


    [ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590559#action_12590559
] 

Doğacan Güney commented on NUTCH-628:
-------------------------------------

+1 for extracting hostdb from crawldb...

(also, do we really want to make hostdb just a map file of <Text,MapWritable>? IMHO,
it would be better to design a proper HostDatum class with some statistics built-in, and then
maybe a Metadata element [I guess it's just me but I hate MapWritable, I prefer Metadata:D])

> Host database to keep track of host-level information
> -----------------------------------------------------
>
>                 Key: NUTCH-628
>                 URL: https://issues.apache.org/jira/browse/NUTCH-628
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  For instance,
Nutch could detect hosts that are timing out, store information about that in this DB.  Segment/fetchlist
Generator could then skip such hosts, so they don't slow down the fetch job.  Another good
use for such a DB is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-user@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as its structures
go?
> Andrzej said:
> The easiest I can imagine is to use something like <Text, MapWritable>.
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Mime
View raw message