I'm actually working on exactly the same problem. Just yesterday, I
implemented a new query (called CooccuranceQuery) that, given a list of
terms, acts as a BooleanQuery with all of the terms being required and
then reports back a list of other terms in the index with a count of how
many documents contain each one in the result set (it actually returns a
TermEnum-like object). There are, of course, a couple of problems with
this. First, as you mentioned, this is not a reasonable solution for an
index with a large number of unique terms. The number of documents
doesn't have as much effect because scanning through documents without
retrieving them is fast. However, each term in the index (as reported by
reader.terms) has to have a TermEnum (or a TermPositions) object. This
can quickly get out of hand. I tried it on an 8000-term index and the
performance seemed pretty good, but once you get up to 30,000...
Another problem with this approach is that MultiSearcher does not
provide easy access to the terms in a the combined index. This I could
solve pretty easily, but it seems that this approach won't scale anyway,
so I'm not doing this yet.
Finally, a bigger problem, is that even if we were to add some kind of
Reader.terms(doc, term) method that would list the terms of a particular
document starting with specified term, we would still get *stemmed*
forms of these terms. In an application that wants to display these to
the user in some way, this will not be acceptable because the stems are
not always complete words (even in English, I don't even know what they
will be in another language). This, of course, has to do with Lucene's
architecture where Analyzer is separated from indexing so that the index
never sees the original word forms.
The only way to solve this that I see right now is to store a dictionary
of "stem, [form1, form2, ...]" for each term in the index externally.
Also store a mapping "doc, [stem1, stem2, ...]" that would be the
document's term vector. For the term dictionary, there simply isn't any
place in Lucene that could store it. For the document's term vector,
this can be stored in Lucene if we create a new datastructure on disk
for it. Finally, storing ther term vector in the document itself leads
to very slow processing because now documents must be retrieved and this
field re-parsed.
Anyway, is there anyone else working on a related problem? Should we
collaborate?
-dmitry
Nestel, Frank wrote:
>Hi,
>
>I've been reading the API and I couldn't figure out a
>nice and fast way to solve the following problem:
>
>I'd like to enumerate the tokens of a document (or
>document field). Do the internal datastructures
>of lucene allow such kind of traversal which is (as
>I understand) of course orthogonal to the access lucene
>is optimized for?
>
>More concrete I have s.th. like 20-50 tokens/words and one
>document and I'd like to ask the document if (and how often)
>it contains those particular tokens. The idea was to augment
>search results with (kind of I know) automatic query
>dependand keywords.
>
>The only way I see right now is to create 20-50 TermEnums
>and walk through them until I end up in my document or
>nowhere? Which is probably not feasible for a search result
>page with (say) 20 hits in a larger index.
>
>Any (more elegant) chance, I missed?
>
>Thank you,
>Frank
>
>--
>Dr. Frank Sven Nestel
>Principal Software Engineer
>
>COI GmbH Erlanger Straße 62, D-91074 Herzogenaurach
>Phone +49 (0) 9132 82 4611
>http://www.coi.de, mailto:Frank.Nestel@coi.de
> COI - Solutions for Documents
>
>
|