lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: constructing smaller phrase queries given a multi-word query
Date Fri, 20 Oct 2006 00:03:52 GMT

: eg. "rowling goblet of fire" - need to match rowling in 1 field &
: "goblet of fire in another
: "hilary duff most wanted" - need to match "hilary duff" in 1 field &
: "most wanted" in another

: >  Why not just index those separate fields into the yet a third field and
: > search there?
: >
: > Or why not just put it all into one field in the first place?
: I need the ability to boost matches in certain fields higher than others.
: So both the above approaches do not work for me.
: Not all fields have the same analyzer, so thats another reason for not
: using 1 catch-all field.

your "catchall" file can have yet-another-analyzer ... typically a more
simplistic analyzer then the field specifcs ones, the important thing is
that the user input is parsed using the appropriate analyzer for each

: I am trying out DisjunctionMaxQueries  and will soon move to it.
: I first want a phrase match to be done & if that fails then a non-phrase match.
: My problem is that I cant easily decide which phrases to build given a
: 4+word query from the user.

The initial approach you described to build all of the possible
permutations is the best approach i can think of - using DisMax to group
things in such a way that ensure that no one term/phrase common in
multiple fields dominates the score ... you'll also probably want to play
with BooleanQuery.setMinNumberShouldMatch and disabling coord ...

...personally i'v never really tried tackling what you describe, because
queries are typically either simple enough that using it as a simple
phrase across multiple fields works, or it's compliacted enough that you
aren't going to find good "sub phrase" matches in a general way.  relying
on matching the individual terms usually covers those cases well enough
(because you've got enough terms to get a meaningful score)

the one exception that i've been considering persuing for one project is
to recognize *really* common phrases that should really be thought of as a
single term (ie: "digital camera") ... but in most cases they probably be
dealt with in the analyzers using synonyms that collapse the multiple
tokens down to a single marker token (ie: digital_camera)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message