nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <>
Subject Questions about the "hostCount" and related variables in org.apache.nutch.crawl.Generator$Selector::reduce()
Date Mon, 04 Jun 2012 11:52:44 GMT

I'm trying to understand how the Nutch Generator class works, but am
finding it very hard to understand the following lines of code, in *
org.apache.nutch.crawl.Generator$Selector**::reduce()*, because the
comments aren't very clear and the code is intricate:


*[~Line 264 onwards]*
        // only filter if we are counting hosts or domains
        if (maxCount > 0) {
*          int[] hostCount = hostCounts.get(hostordomain);
          if (hostCount == null) {
            hostCount = new int[] {1, 0};
            hostCounts.put(hostordomain, hostCount);

          // increment hostCount

          // check if topN reached, select next segment if it is
          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {
            hostCount[1] = 0;

*          ...

1) What is the purpose of the 2-element array 'hostCount' with the values
([0,1])? And what do each of the two index slots represent?

2) And, what is this code doing (from above)? I don't understand what the
relation is between segCounts and hostCount[0] ... nor the relation between
hostCount[0] and maxNumSegments. All of these variables appear orthogonal
to me ... unless I'm misunderstanding their use relative to each other. So,
if someone could elaborate on this, that'd be greatly appreciated.
*          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] <
maxNumSegments) {*
*            hostCount[0]++;
            hostCount[1] = 0;
3) Lastly, I am not clear about what effect "maxNumSegments" has on the
generate phase. Perhaps if I understood the code above I wouldn't ask this
question either...

Many thanks in advance!


View raw message