nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Follmer" <>
Subject RE: A Lucene-FuzzyQuery Plugin
Date Thu, 17 Mar 2005 11:29:16 GMT

I've been experimenting with making these changes to BasicQueryFilter,
to add a bit of fuzz for intranet searches, where you might want to
mis-spelled words (n.b. google does a whole separate thing about
My approach is: don't do fuzzy on 4 letter or less words, or on phrases,
or on anchors or URLs. Its useful to add the logging to BasicQueryFilter
and examine what's being sent to Lucene. Also I give fuzz lower boost.
I guess the right thing to do make a BasicFuzzyQueryFilter that inherits
from BasicQueryFilter and just adds 1 method and overrides 2 methods.


public BooleanQuery filter(Query input, BooleanQuery output) {
    addTerms(input, output);
    addSloppyPhrases(input, output);"Lucene gets: "+ output.toString("content"));
    return output;

private static void addTerms(Query input, BooleanQuery output) {


    if (o.isPhrase()) {
        out.add(exactPhrase(o.getPhrase(), FIELDS[f], FIELD_BOOSTS[f]),
false, false);
    else {
        out.add(termQuery(FIELDS[f], o.getTerm(), FIELD_BOOSTS[f]),
false, false);
        // fuzzy query is really only appropriate for content, 
        // and also its a bit slow, so don't do fuzzy anchors or urls;
        // only do fuzzy for words 5 characters or more
        if ("content".equals(FIELDS[f]) &&
(o.getTerm().toString().length() >= 5))
        out.add(fuzzyQuery(FIELDS[f], o.getTerm()), false, false);


  /** Utility to construct a Lucene fuzzy query. */  
  private static fuzzyQuery(String
field, Term term) {
    // 0.8f means they have to get 80% of the characters right
    FuzzyQuery result = new FuzzyQuery(luceneTerm(field, term), 0.8f,
    // give fuzzy matches less boost in the overall search result
    return result;

View raw message