nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages
Date Mon, 14 Aug 2017 18:51:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126215#comment-16126215
] 

ASF GitHub Bot commented on NUTCH-1932:
---------------------------------------

lewismc commented on a change in pull request #211: NUTCH-1932 Automatically remove orphaned
pages
URL: https://github.com/apache/nutch/pull/211#discussion_r133030580
 
 

 ##########
 File path: src/java/org/apache/nutch/scoring/ScoringFilter.java
 ##########
 @@ -179,6 +179,20 @@ public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
       List<CrawlDatum> inlinked) throws ScoringFilterException;
 
   /**
+   * This method may change the score or status of CrawlDatum during CrawlDb
+   * update, when the URL is neither fetched nor has any inlinks.
+   *
+   * @param url
+   *          URL of the page
+   * @param datum
+   *          CrawlDatum for page
+   * @throws ScoringFilterException
 
 Review comment:
   I think this 'may' break Javadoc generation if no comment is provided alongside the Exception
itself.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, e.g. it has
no more other pages linking to it. If a page hasn't been linked to after markGoneAfter seconds,
the page is marked as gone and is then removed by an indexer.  If a page hasn't been linked
to after markOrphanAfter seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message