nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <doga...@gmail.com>
Subject Re: OPIC scoring differences
Date Wed, 11 Jul 2007 14:41:58 GMT
On 7/9/07, Andrzej Bialecki <ab@getopt.org> wrote:
> Carl Cerecke wrote:
> > Hi,
> >
> > The docs for the OPICScoringFilter mention that the plugin implements a
> > variant of OPIC from Artiboul et al's paper. What exactly is different?
> > How does the difference affect the scores?
>
> As it is now, the implementation doesn't preserve the total "cash value"
> in the system, and also there is almost no smoothing between the
> iterations (Abiteboul's "history").
>
> As a consequence, scores may (and do) vary dramatically between
> iterations, and they don't converge to stable values, i.e. they always
> increase. For pages that get a lot of score contributions from other
> pages this leads to an explosive increase into the range of thousands or
> eventually millions. This means that the scores produced by the OPIC
> plugin exaggerate score differences between pages more and more, even if
> the web graph that you crawl is stable.
>
> In a sense, to follow the "cash" analogy, our implementation of OPIC
> illustrates a runaway economy - galloping inflation, rich get richer and
> poor get poorer ;)
>
>
> > Also, there's a comment in the code:
> >
> > // XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
> > // XXX in the paper, where page "loses" its score if it's distributed to
> > // XXX linked pages...
> >
> > Is this something that will be looked at eventually or is the scoring
> > "good enough" at the moment without some "adjustment".
>
> Yes, I'll start working on it when I get back from vacations. I did some
> simulations that show how to fix it (see
> http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).

Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks.
         - At this time, it is possible that page Pk links to page A.
So, now Page A's cash > 0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
         - Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.
         - I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)

Or am I missing something here?

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney
Mime
View raw message