nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "NutchScoring" by LewisJohnMcgibbney
Date Sun, 21 Sep 2014 18:00:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchScoring" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchScoring?action=diff&rev1=8&rev2=9

  Scoring occurs in numerous places throughout the Nutch codebase and consequently within
the crawl cycle. This section describes the point of occurence and functional purpose at each
step.
   
   * [[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java|./src/java/org/apache/nutch/crawl/Injector.java]]
- Scoring filters are defined within the various MapReduce job configurations. This means
that the desired configuration will be used appropriately at runtime when the job is run by
the JobClient. The Injector actually contains two MapReduce jobs, namely
-     * sortJob - where we set the InjectMapper as the Mapreduce Mapper override. The InjectMapper
uses ScoringFilters to calculate a new initial score for a particular URL based on passing
in the Hadoop Text key (representing the URL of the page) and associated CrawlDatum value
(representing a new datum. Filters will modify it in-place) to the ScoringFilters.injectedScore
method. Essentially this sets an initial score for newly injected pages. It should be noted
that newly injected pages may have no inlinks, so filter implementations may wish to set this
score to a non-zero value, to give newly injected pages some initial credit.
-     * mergeJob - 
+     * sortJob - where we set the InjectMapper as the Mapreduce Mapper override. The InjectMapper
uses ScoringFilters to calculate a new initial score for a particular URL based on passing
in the Hadoop Text key (representing the URL of the page) and associated CrawlDatum value
(representing a new datum for which filters will modify it in-place) to the ScoringFilters.injectedScore
method. Essentially this sets an initial score for newly injected pages. It should be noted
that newly injected pages may have no inlinks, so filter implementations may wish to set this
score to a non-zero value, to give newly injected pages some initial credit. We are concerned
with the value for {{{db.score.injected}}} in this case as this assigns a default of 1.0f
against the score of new pages added by the injector. This default score can however be overridden
by associating the {{{nutch.score}}} metadata flag against any URL in a seed list. This allows
to set a custom score for a specific URL. If this is the case we assign this score to the
CrawlDatum object, if not then we use the default score as described above.
+     * mergeJob - which combines multiple new entries for a given URL. An example of when
this is necessary would be if we attempt to inject two URLs within the same seed list. In
this job we are concerned with discovering the value for the {{{db.score.injected}}} configuration
property present within {{{nutch-site.xml}}}. This value represents the score of new pages
added by the injector. In this job this is relevant for us as we must know if a record already
exists and we wish to update but not overwrite the value.
   * ./src/java/org/apache/nutch/crawl/CrawlDbReducer.java
   * ./src/java/org/apache/nutch/crawl/Generator.java
   * ./src/java/org/apache/nutch/fetcher/Fetcher.java

Mime
View raw message