nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Thu, 17 Apr 2008 08:11:21 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589861#action_12589861
] 

Andrzej Bialecki  commented on NUTCH-628:
-----------------------------------------

IMHO a better option would be to put this data into CrawlDb, and then maintain HostDB data
using CrawlDb as the source. The reason is that segments may contain duplicate urls, they
may be missing,may be unparsed,  etc - in short, they are transient and not unique. Whereas
a CrawlDb is a persistent store of our knowledge about all known urls, and contains only unique
urls.

So, I think that Fetcher-s should put this information in crawl_fetch, the updatedb should
stick this information into CrawlDb-s CrawlDatum (this should happen automatically), and the
HostDb would simply perform an aggregation of this info from CrawlDb, using hostname / domain
name / tld as the keys.

> Host database to keep track of host-level information
> -----------------------------------------------------
>
>                 Key: NUTCH-628
>                 URL: https://issues.apache.org/jira/browse/NUTCH-628
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  For instance,
Nutch could detect hosts that are timing out, store information about that in this DB.  Segment/fetchlist
Generator could then skip such hosts, so they don't slow down the fetch job.  Another good
use for such a DB is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-user@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as its structures
go?
> Andrzej said:
> The easiest I can imagine is to use something like <Text, MapWritable>.
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message