nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semyon Semyonov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
Date Wed, 17 Jan 2018 13:32:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Semyon Semyonov updated NUTCH-2481:
-----------------------------------
    Component/s: generator

> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
>                 Key: NUTCH-2481
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2481
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator, hostdb
>            Reporter: Semyon Semyonov
>            Priority: Major
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb.
The motivation is usage of this statistics in generate with maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of fetched <
150. 
> The problem is for some websites that condition will (almost)never be finished, because
of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of the process.
Lets say generate while number of fetched < 150 && delta_fetched > 1. 
> Therefore in this case the process should stop on round 5 with total number of fetched
equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is
stored in a Dictionary(RAM)*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate
= true/false). Calculate the value of StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem - lack of
flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message