nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semyon Semyonov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
Date Wed, 17 Jan 2018 13:36:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Semyon Semyonov updated NUTCH-2481:
-----------------------------------
    Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb.
The motivation is usage of this statistics in generate with maxCount expressions.

 

The 

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb.
The motivation is usage of this statistics in generate with maxCount expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of fetched < 150.

The problem is for some websites that condition will (almost)never be finished, because of
its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the process. Lets
say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of fetched equals
to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored
in a Dictionary(RAM)*

2. 
Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate
= true/false). Calculate the value of StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - lack of flexibility
in Generate*  


> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
>                 Key: NUTCH-2481
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2481
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator, hostdb
>            Reporter: Semyon Semyonov
>            Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb.
The motivation is usage of this statistics in generate with maxCount expressions.
>  
> The 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message