lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Extract terms not by reader, but by documents
Date Wed, 05 Sep 2007 18:50:20 GMT
Rafael, are you looking for IndexReader.getTermFreqVector?


5 sep 2007 kl. 16.48 skrev Rafael Rossini:

> Thank´s for the reply Grant, let me try to explain exactly what I´d  
> like to
> do. Take the 2 docs:
> Doc1: "Microsoft is a nice software company, and Xbox seems to be a  
> nice
> product too."
> Doc2: "Nintendo and Sony have been in the game industry for a long  
> time, but
> now, Microsoft is trying to enter with Xbox"
> Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR
> (Xbox)" and perform a query.rewrite(reader).extractTerms(set), my  
> set is
> going to have all the terms (Nintendo, Sony, Microsoft, Xbox)  
> right? But
> when I iterate the docs, I wanted to do something like
> query.rewrite(doc).extracTerms(set),  [I know this method does not  
> exist, is
> just and example of the funcionality].
> So, for Doc1, my set would be populate with (Xbox) only, and for  
> Doc2 the
> set would be populated with (Nintendo, Sony, Microsoft, Xbox).
> Is this possible? Is it clear now what I´m trying to achieve?
> []s
>      Rossini
> On 9/4/07, Grant Ingersoll <> wrote:
>> Not sure if I am understanding what you are trying to do.  I think
>> you are trying to find out which terms occurred in a particular
>> document, correct?
>> I also am not sure about your first example.  My understanding of
>> extractTerms is that it just gives you back the set of all terms that
>> occur in the _query_, not necessarily those that matched in the
>> document, although it has this effect for things like WildcardQuery
>> and others that get expanded using TermEnum since they are expanded
>> based on what is in the index.  I think this is best seen by the
>> implementation of extractTerms() in in which it just
>> adds the term from the query into the set.  Likewise for BooleanQuery
>> which loops over the clauses and extracts the terms from each clause
>> and adds them to the set.  Thus, if you had a boolean query of all
>> term queries, you would get back the set of all the terms.
>> As for the problem it sounds like you are interested in, you could
>> use SpanQuery functionality with some post processing analysis or try
>> using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
>> functionality (or possibly a combination of both).  In this case, you
>> will need to write your own implementation of the TVM that takes in
>> the query so it knows what terms to identify. If you go the latter
>> route, know that it is new functionality and probably doesn't have a
>> whole lot of users yet, so there may still be issues with it.  See
>> the nightly build or nightly javadocs for info on these.
>> The other question that might be helpful, is what custom highlighting
>> are you doing that isn't covered by the contrib/highlighter?  Perhaps
>> you have some suggestions that are generic enough to help improve
>> it?  Just a thought.
>> Hope this helps,
>> Grant
>> On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:
>>> Hi all,
>>>     In some custom highlighting, I often write a code like this:
>>>        Set<Term> matchedTerms = new HashSet<Term>();
>>>        query.rewrite(reader).extractTerms(matchedTerms);
>>>     With this code the Term Set gets populated by the matched query
>>> in your
>>> whole index. Is it possible to this with a document instead of the
>>> reader?
>>> Something like
>>> query.rewrite(documentId).extractTerms(matchedTerms) ?
>>> []s
>>>      Rossini
>> --------------------------
>> Grant Ingersoll
>> Lucene Helpful Hints:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message