lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Commit Tag Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4628) Add common terms query to gracefully handle very high frequent terms dynamically
Date Fri, 14 Dec 2012 09:00:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532168#comment-13532168
] 

Commit Tag Bot commented on LUCENE-4628:
----------------------------------------

[trunk commit] Simon Willnauer
http://svn.apache.org/viewvc?view=revision&revision=1421743

LUCENE-4628: Added CommonTermsQuery

                
> Add common terms query to gracefully handle very high frequent terms dynamically
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-4628
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4628
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/other
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4628.patch, LUCENE-4628.patch
>
>
> I had this problem quite a couple of times the last couple of month that searches very
often contained super high frequent terms and disjunction queries became way too slow. The
main problem was that stopword filtering wasn't really an option since in the domain those
high-freq terms where not really stopwords though. So for instance searching for a song title
"this is it" or for a band "A" didn't really fly with stopwords. I thought about that for
a while and came up with a query based solution that decides based on a threshold if something
is considered a stopword or not and if so it moves the term in two boolean queries one for
high-frequent and one for low-frequent such that those high frequent terms are only matched
if the low-frequent sub-query produces a match. Yet if all terms are high frequent it makes
the entire thing a Conjunction which gave me reasonable results as well as performance. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message