nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
Date Tue, 09 May 2006 21:20:05 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378755 ] 

Andrzej Bialecki  commented on NUTCH-267:
-----------------------------------------

I would argue that what Nutch implements now shouldn't be called OPIC, because it has little
to do with the algorithm described in the OPIC paper. Either we fix it, or we should rename
it. Let me explain:

* the paper uses a "cash flow" concept, where nodes not only receive score contributions,
but also give them away thus _reducing_ their available score. This is not implemented in
Nutch, which leads to scores growing into infinity. This also makes the score dependent on
the number of fetch cycles, i.e. the scores of two pages with exactly the same inlinks will
be different if one of them underwent more refresh cycles than the other. So, the fundamental
premise of the algorithm - that scores would converge to certain values as a result of cash
flow balance - is not retained.

* the paper uses a concept of "virtual nodes" that give away cash to disconnected nodes in
the current graph. In reality, these nodes are probably connected, but the current graph is
not complete enough to track it. The Nutch implementation doesn't use this, but only because
it doesn't give away "cash".

* finally, the paper argues that OPIC score and other different scores should be combined
as a sum of logarithms, i.e. "log(opic) + log(docSimilarity)". Nutch uses a formula "sqrt(opic)
* docSimilarity" (through document boosting).

I'm going to commit the scoring API soon, this should make it easier to experiment with different
scoring models.

> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
>          Key: NUTCH-267
>          URL: http://issues.apache.org/jira/browse/NUTCH-267
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Chris Schneider
>     Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count
was true, the indexer boost value was scaled based on the log of the # of inbound links:
>     if (boostByLinkCount)
>       res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). Instead, the
boost value is just the square root (or some other scorePower) of the page score. Shouldn't
the invertlinks command, which creates the linkdb, have some affect on the boost value calculated
during indexing (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message