nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Follmer" <sfoll...@meer.net>
Subject RE: A Lucene-FuzzyQuery Plugin
Date Thu, 17 Mar 2005 11:29:16 GMT


I've been experimenting with making these changes to BasicQueryFilter,
to add a bit of fuzz for intranet searches, where you might want to
match
mis-spelled words (n.b. google does a whole separate thing about
mis-spelling).
My approach is: don't do fuzzy on 4 letter or less words, or on phrases,
or on anchors or URLs. Its useful to add the logging to BasicQueryFilter
and examine what's being sent to Lucene. Also I give fuzz lower boost.
I guess the right thing to do make a BasicFuzzyQueryFilter that inherits
from BasicQueryFilter and just adds 1 method and overrides 2 methods.

-Steve



public BooleanQuery filter(Query input, BooleanQuery output) {
    addTerms(input, output);
    addSloppyPhrases(input, output);
    LOG.info("Lucene gets: "+ output.toString("content"));
    return output;
}



private static void addTerms(Query input, BooleanQuery output) {

... 

    if (o.isPhrase()) {
        out.add(exactPhrase(o.getPhrase(), FIELDS[f], FIELD_BOOSTS[f]),
false, false);
    }
    else {
        out.add(termQuery(FIELDS[f], o.getTerm(), FIELD_BOOSTS[f]),
false, false);
        // fuzzy query is really only appropriate for content, 
        // and also its a bit slow, so don't do fuzzy anchors or urls;
        // only do fuzzy for words 5 characters or more
        if ("content".equals(FIELDS[f]) &&
(o.getTerm().toString().length() >= 5))
        out.add(fuzzyQuery(FIELDS[f], o.getTerm()), false, false);
        }
    }

...


  /** Utility to construct a Lucene fuzzy query. */  
  private static org.apache.lucene.search.FuzzyQuery fuzzyQuery(String
field, Term term) {
    // 0.8f means they have to get 80% of the characters right
    FuzzyQuery result = new FuzzyQuery(luceneTerm(field, term), 0.8f,
2);
    // give fuzzy matches less boost in the overall search result
ranking
    result.setBoost(0.5f);
    return result;
  }
  





Mime
View raw message