nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <>
Subject Re: NUTCH-92
Date Thu, 27 Nov 2008 16:42:32 GMT

On Wed, Nov 26, 2008 at 3:04 AM, Andrzej Bialecki <> wrote:
> Hi all,
> After reading this paper:
> I came up with the following idea of implementing global IDF in Nutch. The
> upside of the approach I propose is that it brings back the cost of making a
> search query to 1 RPC call. The downside is that the search servers need to
> cache global IDF estimates as computed by the DS.Client, which ties them to
> a single query front-end (DistributedSearch.Client), or requires keeping a
> map of <client, globalIDFs> on each search server.
> ---------
> First, as the paper above claims, we don't really need exact IDF values of
> all terms from every index. We should get acceptable quality if we only
> learn the top-N frequent terms, and for the rest of them we apply a
> smoothing function that is based on global characteristics of each index
> (such as the number of terms in the index).
> This means that the data that needs to be collected by the query integrator
> (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist
> of a list of e.g. top 500 local terms with their frequency, plus the local
> smoothing factor as a single value.
> We could further reduce the amount of data to be sent from/to shard servers
> by encoding this information in a counted Bloom filter with a single-byte
> resolution (or a spectral Bloom filter, whichever yields a better precision
> / bit in our case).
> The query integrator would ask all active shard servers to provide their
> local IDF data, and it would compute global IDFs for these terms, plus a
> global smoothing factor, and send back the updated information to each shard
> server. This would happen once per lifetime of a local shard, and is needed
> because of the local query rewriting (and expansion of terms from Nutch
> Query to Lucene Query).
> Shard servers would then process incoming queries using the IDF estimates
> for terms included in the global IDF data, or the global smoothing factors
> for terms missing from that data (or use local IDFs).
> The global IDF data would have to be recomputed each time the set of shards
> available to a DS.Client changes, and then it needs to be broadcast back
> from the client to all servers - which is the downside of this solution,
> because servers need to keep a cache of this information for every DS.Client
> (each of them possibly having a different list of shard servers, hence
> different IDFs). Also, as shard servers come and go, the IDF data keeps
> being recomputed and broadcast, which increases the traffic between the
> client and servers.
> Still I believe the amount of additional traffic should be minimal in a
> typical scenario, where changes to the shards are much less frequent than
> the frequency of sending user queries. :)
> ------
> Now, if this approach seems viable (please comment on this), what should we
> do with the patches in NUTCH-92 ?
> 1. skip them for now, and wait until the above approach is implemented, and
> pay the penalty of using skewed local IDFs.
> 2. apply them now, and pay the penalty of additional RPC call / search, and
> replace this mechanism with the one described above, whenever that becomes
> available.

It seems I wrote the patch in NUTCH-92. My recollection was that you
wrote it, Andrzej :D
Anyway, I have no idea what I did in that patch, don't know if it
works or applies etc. Really,
I am just curios. Did anyone test it? Does it really work :) ?

I haven't read the paper yet but the proposed approach sounds better
to me. Do you have any
code ready, Andrzej? Or how difficult is it to implement it?

> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com

Doğacan Güney
View raw message