nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb
Date Thu, 06 Sep 2007 17:38:31 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525475
] 

Andrzej Bialecki  commented on NUTCH-530:
-----------------------------------------

I'm still against this patch, exactly because we are not sure how many times the ScoringFilters
will be executed - it may be once, twice or N times. The current contract for ScoringFilters
is that they are executed once.

CrawlDbReducer itself does not reduce all inlinked datums to a single CrawlDatum - it's up
to the scoring filters to do whatever they want to do with all inlinks - although it's true
that scoring-opic performs an operation equivalent to this, this may not always be the case.

Second, let's consider the following scenario (BTW, this is close to one of the ScoringFilters
that I actually implemented, so it's not far fetched): let's say I implemented a ScoringFilter
that checks for existence of a flag in CrawlDatum (presumably put there by Generator), and
based on the value of this flag it counts the score from inlinks differently. Then it clears
the flag to mark a successful update. If we ran updatedb that includes your patch, this operation
would work ok in the first spill from the Combiner (although with vastly incomplete information),
and then it would fail to do the right thing on subsequent runs through the Combiner or Reducer,
because the flag would be already reset.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map
task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message