lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <>
Subject Re: Iterators for collecting Terms from Queries
Date Thu, 20 Mar 2003 01:46:43 GMT
>Also, I started thinking that perhaps combining parts of two approaches would 
>make lots of sense, improving performance of my solution, and generalizing 
>your solution a bit? (ie. there'd be more support from core Lucene for 
>implementing highlighters)
>I think having a term query collector (and matching iterator) makes sense. 
>This way all Queries could be easily collected, along with some flags that 
>BooleanClause has (optional etc). This is fairly easy to do, and doesn't have 
>too many performance problems. Plus, caller need then not worry about actual 
>Query tree structure, even if new Queries are added, it's Query's 
>responsibility to add that one traversal method implementation.
>I also don't think this adds too much clutter to general code base.
>However, after queries are collected, it would be possible to access collected 
>Terms using method you implemented, ie. having a method to access Terms 
>collected during query execution. Caller also can choose to do additional 
>query type dependant handling if/as necessary at this point (to access slop 
>amongst other things?)
>So essentially one could traverse all Queries easily, and for each one ask for 
>all the actual terms, without having to worry about exact query type, unless 
>it wants to.
>Now, for some extra convenience, it would be easy to add simple iterators over 
>actual terms. Since method for accessing collected Terms would be in base 
>class, there would be no need to have half a dozen or more iterator classes I 
>had to add to encapsulate collection process. But that would be optional 
>thing to have.
>Finally, a method similar to accessing collected actual terms, but for 
>accessing base term(s) would be useful. Since there can be up to 2 base terms 
>(for range query), I'm not sure of method signature, but implementation 
>should be easy to add (perhaps use signature similar to many JDK API methods, 
>where an optional Collection is passed, into which store Term(s); if null is 
>passed, a new Collection like ArrayList is created and returned).
>Does this make sense?

I am not 100% sure, but i think so, could you give me an example even pseudo-code?

>That could be, for big data sets, and prefix/wildcard queries that have lots 
>of terms.
>Fortunately highlighting is only done for single documents at a time 

It is, but what that has to do with a test? we can still run a test and see the difference
in collecting terms, my suggestion was that.
May be i didn't explain myself properly, english is my second language btw.

>Another way around the problem is to start from highlighted document, and 
>build a (temporary) index, and actually execute query against just this 
>single dummy (RAMDirectory based) index (that contains only terms from that 
>one doc to be highlighter). It would be interesting to see if this might be 
>more efficient way to find actual matched terms.

Do you mean, index just one document and use the search itself to highligh it? it could work,
especially in a pool of thread, but i believe it will be too much IO file handler etc.
Or did you mean something else?

>I agree, generic (actual) term access/collecting method should be available 
>from any Query (and actually same for base terms).
Take a look at the code and tell me what do you think.

>Yes, I just happened to notice it in search package, didn't know such a thing 
>existed as query parser has (currently?) no way to use it. :-)
>Of course, having PhrasePrefixQuery, one wonders if it'd make sense to
>have PhraseWildcardQuery as well. :-)
>(don't think implementing that would be any more difficult than prefix one, 
>but both may be fairly inefficient in some cases)
Yes and no, the purpose of optimize your solution is just when we have big amount of data,
run a query like that would be slow and not useful becuase it will retrive potentially a lot
of documents IF we run just by itself, but if there is another clause in the query it could
be very useful, so that 2nd or 3rd clause will bring down our number of search results and
the wildcardphrasequery will make the difference, a nice one!

>Thanks for your ideas and suggestions,
Thanks to you you too! 
Attached there is zip file that contains my prototype of collector, i know it can be optimized
and that it reflect my needs (see SlopeClause) but it is a good point to start, also the constructor
with the boolean to skip the term collector is not there because i always collect them, it
could be added easly.
Take a look and tell me what do you think.

Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
View raw message