nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Dean <seand...@rogers.com>
Subject Re: NUTCH-92
Date Wed, 26 Nov 2008 06:04:21 GMT
This method of calculating global IDF values certainly sounds more efficient then the currently
proposed method. The reduction of 1 RPC call during the search query (so that only 1 RPC call
is made in total) should reduce the overall load on each search server. I prefer the idea
of having network broadcasts going out during the initial startup and only thereafter during
a topology changing event.

To me this kind of sounds like network routing tables, the initial table is setup during startup
and checked periodically for changes. When a change is detected the table is modified (sometimes
regenerated completely) and the network continues to operate. The alternative (based on the
current patch) is to check the table every time a packet (or maybe connection) is sent to
one of the devices listed inside. This method may be faster to detect any problem but the
additional load would be substantial.

With all this said though, the amount of time needed to research and develop this new method
may take an extended period of time depending on developer availability. We have a proposed
solution (albeit not as nice) that did work on older code that may only need a quick refresh
to work with trunk (and the future 1.0 release). I would personally like to see NUTCH-92 (or
some form of it) included in trunk for a legitimate evaluation before the next release.


Sean Dean




________________________________
From: Andrzej Bialecki <ab@getopt.org>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, November 25, 2008 8:04:22 PM
Subject: NUTCH-92

Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. 
The upside of the approach I propose is that it brings back the cost of 
making a search query to 1 RPC call. The downside is that the search 
servers need to cache global IDF estimates as computed by the DS.Client, 
which ties them to a single query front-end (DistributedSearch.Client), 
or requires keeping a map of <client, globalIDFs> on each search server.

---------

First, as the paper above claims, we don't really need exact IDF values 
of all terms from every index. We should get acceptable quality if we 
only learn the top-N frequent terms, and for the rest of them we apply a 
smoothing function that is based on global characteristics of each index 
(such as the number of terms in the index).

This means that the data that needs to be collected by the query 
integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) 
would consist of a list of e.g. top 500 local terms with their 
frequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shard 
servers by encoding this information in a counted Bloom filter with a 
single-byte resolution (or a spectral Bloom filter, whichever yields a 
better precision / bit in our case).

The query integrator would ask all active shard servers to provide their 
local IDF data, and it would compute global IDFs for these terms, plus a 
global smoothing factor, and send back the updated information to each 
shard server. This would happen once per lifetime of a local shard, and 
is needed because of the local query rewriting (and expansion of terms 
from Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDF 
estimates for terms included in the global IDF data, or the global 
smoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set of 
shards available to a DS.Client changes, and then it needs to be 
broadcast back from the client to all servers - which is the downside of 
this solution, because servers need to keep a cache of this information 
for every DS.Client (each of them possibly having a different list of 
shard servers, hence different IDFs). Also, as shard servers come and 
go, the IDF data keeps being recomputed and broadcast, which increases 
the traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in a 
typical scenario, where changes to the shards are much less frequent 
than the frequency of sending user queries. :)

------

Now, if this approach seems viable (please comment on this), what should 
we do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented, 
and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search, 
and replace this mechanism with the one described above, whenever that 
becomes available.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Mime
View raw message