nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (NUTCH-628) Host database to keep track of host-level information
Date Sat, 02 Jul 2011 07:31:29 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058993#comment-13058993
] 

Lewis John McGibbney edited comment on NUTCH-628 at 7/2/11 7:30 AM:
--------------------------------------------------------------------

>From previous discussion on this ticket I think there is evidence that this class has
some useful credentials. The problem is that the issue is still open and that there is no
entry for this in current Nutch 1.3 /bin/nutch script. Is it worth while providing a patch
for this?


      was (Author: lewismc):
    From previous discussion on this ticket I think there is evidence that this class has
some useful credentials. The problem is that the issue is still open and that there is no
entry for this in current /bin/nutch script. Is it worth while providing a patch for this?

  
> Host database to keep track of host-level information
> -----------------------------------------------------
>
>                 Key: NUTCH-628
>                 URL: https://issues.apache.org/jira/browse/NUTCH-628
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch, domain_statistics_v2.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  For instance,
Nutch could detect hosts that are timing out, store information about that in this DB.  Segment/fetchlist
Generator could then skip such hosts, so they don't slow down the fetch job.  Another good
use for such a DB is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-user@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as its structures
go?
> Andrzej said:
> The easiest I can imagine is to use something like <Text, MapWritable>.
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message