lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanshu <g.himan...@gmail.com>
Subject using wildcard/regex query code
Date Wed, 18 May 2016 03:38:29 GMT
java-user@lucene.apache.org

Hi,

I'm trying to use code from lucene-core for following use-case in my
project.

Given a big sorted list of string words (call it dictionary) and a
wildcard/regex pattern, return the list of index of words from dictionary
that matched the wildcard pattern.

Here is one implementation for wildcard queries....

  public List<Integer> match(String wildcardPattern, List<String>
dictionary)
  {
    WildcardQuery query = new WildcardQuery(new Term("dummy",
wildcardPattern));
    Automaton automaton = query.getAutomaton();
    CharacterRunAutomaton runner = new CharacterRunAutomaton(automaton);

    List<Integer> result = new ArrayList<>();
    for (int i = 0; i < dictionary.size(); i++) {
      if (runner.run(dictionary.get(i))) {
        result.add(i);
      }
    }

    return result;
  }


Above implementation works but does not exploit the sorted nature of
dictionary and I guess there are ways to do that from using some other code
from lucene-core.  My guess is based on the javadoc on WildcardQuery (and
similar comment in RegexQuery doc)

"Note this query can be slow, as it needs to iterate over many terms. In
order to prevent extremely slow WildcardQueries, a Wildcard term should not
start with the wildcard *"


For example, if I knew all the prefixes from wildcard pattern, then I can
prune dictionary by focusing my search on the words that have those
prefixes (such pruning can be done possibly via binary search).
Can someone give me pointers or show me in the lucene code where similar
thing is done?



Thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message