nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject NUTCH-92
Date Wed, 26 Nov 2008 01:04:22 GMT
Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. 
The upside of the approach I propose is that it brings back the cost of 
making a search query to 1 RPC call. The downside is that the search 
servers need to cache global IDF estimates as computed by the DS.Client, 
which ties them to a single query front-end (DistributedSearch.Client), 
or requires keeping a map of <client, globalIDFs> on each search server.

---------

First, as the paper above claims, we don't really need exact IDF values 
of all terms from every index. We should get acceptable quality if we 
only learn the top-N frequent terms, and for the rest of them we apply a 
smoothing function that is based on global characteristics of each index 
(such as the number of terms in the index).

This means that the data that needs to be collected by the query 
integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) 
would consist of a list of e.g. top 500 local terms with their 
frequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shard 
servers by encoding this information in a counted Bloom filter with a 
single-byte resolution (or a spectral Bloom filter, whichever yields a 
better precision / bit in our case).

The query integrator would ask all active shard servers to provide their 
local IDF data, and it would compute global IDFs for these terms, plus a 
global smoothing factor, and send back the updated information to each 
shard server. This would happen once per lifetime of a local shard, and 
is needed because of the local query rewriting (and expansion of terms 
from Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDF 
estimates for terms included in the global IDF data, or the global 
smoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set of 
shards available to a DS.Client changes, and then it needs to be 
broadcast back from the client to all servers - which is the downside of 
this solution, because servers need to keep a cache of this information 
for every DS.Client (each of them possibly having a different list of 
shard servers, hence different IDFs). Also, as shard servers come and 
go, the IDF data keeps being recomputed and broadcast, which increases 
the traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in a 
typical scenario, where changes to the shards are much less frequent 
than the frequency of sending user queries. :)

------

Now, if this approach seems viable (please comment on this), what should 
we do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented, 
and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search, 
and replace this mechanism with the one described above, whenever that 
becomes available.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message