nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (NUTCH-628) Host database to keep track of host-level information
Date Thu, 17 Apr 2008 05:43:21 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589833#action_12589833
] 

otis edited comment on NUTCH-628 at 4/16/08 10:42 PM:
------------------------------------------------------------------

HostDatum.java
  - really just a holds MapWritable

HostDb.java
  - can read an existing HostDb (MapReduce job)
  - can merge host info from segments into the main HostDb (MapReduce job)


The above classes are in the patch.  Their descriptions are what the plan is and where the
patch is headed.  While I have not run/tested this code yet, I would *very* much appreciate
if others could have a look and comment on the approach, and have a look at the 2 inner Mapper
and 2 inner Reducer classes.

As for where the host data will come from, I intend to modify Fetcher2 to dump host stats
(number of requests, successes, failures, exceptions, timeouts, etc.)  to, say, fetch_hosts
file in the current segment.  At this point I don't know what the best file format would be
for that, so please .... show me the way.

      was (Author: otis):
    HostDatum.java
  - really just a holds MapWritable

HostDb.java
  - can read an existing HostDb (MapReduce job)
  - can merge host info from segments into the main HostDb (MapReduce job)


The above classes are in the patch.  Their descriptions are what the plan is and where the
patch is headed.  While I have not run/tested this code yet, I would *very* much appreciate
if others could have a look and comment on the approach, and have a look at the 2 inner Mapper
and 2 inner Reducer classes.

As for where the host data will come from, I intend to modify Fetcher2 to dump host stats
(number of requests, successes, failures, exceptions, timeouts, etc.)  to, say, fetch_hosts
file in the current segment.  At this point I don't know what the best file format would be,
so 
  
> Host database to keep track of host-level information
> -----------------------------------------------------
>
>                 Key: NUTCH-628
>                 URL: https://issues.apache.org/jira/browse/NUTCH-628
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  For instance,
Nutch could detect hosts that are timing out, store information about that in this DB.  Segment/fetchlist
Generator could then skip such hosts, so they don't slow down the fetch job.  Another good
use for such a DB is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-user@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as its structures
go?
> Andrzej said:
> The easiest I can imagine is to use something like <Text, MapWritable>.
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message