lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: How to get the terms matching a WildCardQuery in Lucene 6.2?
Date Tue, 25 Oct 2016 19:30:28 GMT
A WildcardTerm subclasses a MultitermQuery.  If you are using the QueryParser, you need to
set the rewrite method on the parser.

Try this…and beware of hitting the max BooleanQuery clause limit…and/or reset that



BooleanQuery.setMaxClauseCount(numberBigEnoughForYourNeeds);



import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Weight;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class RewriteTest {




    /** Simple command-line based search demo. */
    public static void main(String[] args) throws Exception {
        Analyzer analyzer = new StandardAnalyzer();
        String field = "contents";
        Directory directory = new RAMDirectory();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, config);
        for (int i = 0; i < 100; i++) {
            Document d = new Document();
            d.add(new TextField(field, "aard00"+i, Field.Store.YES));
            indexWriter.addDocument(d);
        }
        indexWriter.flush();
        indexWriter.close();

        String queryString = "aard????";

        IndexReader reader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(reader);


        QueryParser parser = new QueryParser(field, analyzer);
        parser.setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE);
        Query q = parser.parse(queryString);
        q = q.rewrite(reader);
        Set<Term> terms = new HashSet<>();
        Weight weight = q.createWeight(searcher, false);
        weight.extractTerms(terms);
        for (Term t : terms) {
            System.out.println(t);
        }
        reader.close();
    }

}


From: Evert Wagenaar [mailto:evert.wagenaar@gmail.com]
Sent: Tuesday, October 25, 2016 1:42 PM
To: java-user@lucene.apache.org
Subject: Re: How to get the terms matching a WildCardQuery in Lucene 6.2?

Hi Allison,

Unfortunately I can't compile the code (see below). Can you tell me what's wrong?
I tried both MultiTermQuery.SCORING_BOOLEAN_REWRITE and CONSTANT_SCORE_BOOLEAN_REWRITE

What I don't understand actually is the relation between my Query (which is a wildcard Query
and not a MultiTermQuery.

Can you explain?

Thanks,

Evert Wagenaar


[Inline image 1]

Full code of Searcher:


package tk.evertwagenaar.lucene;



import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.nio.charset.StandardCharsets;

import java.nio.file.Files;

import java.nio.file.Paths;

import java.util.Date;

import java.util.HashSet;

import java.util.Set;



import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.Term;

import org.apache.lucene.queryparser.classic.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.MultiTermQuery;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.search.Weight;

import org.apache.lucene.store.FSDirectory;



/** Simple command-line based search demo. */

public class SearchFiles {



       private static IndexReader reader;

       private static Query q;



       private SearchFiles() {

       }



       /** Simple command-line based search demo. */

       public static void main(String[] args) throws Exception {

              String usage = "Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir]
[-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee
http://lucene.apache.org/core/4_1_0/demo/ for details.";

              if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0])))
{

                     System.out.println(usage);

                     System.exit(0);

              }



              String index = "index";

              String field = "contents";

              String queries = null;

              int repeat = 0;

              boolean raw = false;

              String queryString = "aard????";

              int hitsPerPage = 10;



              reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));

              IndexSearcher searcher = new IndexSearcher(reader);

              Analyzer analyzer = new StandardAnalyzer();



              BufferedReader in = null;



              QueryParser parser = new QueryParser(field, analyzer);

              while (true) {

                     if (queries == null && queryString == null) { // prompt the user

                            System.out.println("Enter query: ");

                     }



                     Query q = parser.parse(queryString);

                     System.out.println("Searching for: " + q.toString(field));



                     if (repeat > 0) { // repeat & time as benchmark

                            Date start = new Date();

                            for (int i = 0; i < repeat; i++) {

                                   searcher.search(q, 100);

                            }

                            Date end = new Date();

                            System.out.println("Time: " + (end.getTime() - start.getTime())
+ "ms");

                            doPagingSearch(in, searcher, q, hitsPerPage, raw, queries == null
&& queryString == null);





                     MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE



                            q = q.rewrite(reader);

                            Set<Term> terms = new HashSet<>();

                            Weight weight = q.createWeight(searcher, false);

                            terms = weight.extractTerms(terms);



                            System.out.println("Match: " + terms);

                            reader.close();



                     }

              }

       }



       /**

       * Search the Query against the Index

       */

       public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query
query, int hitsPerPage,

                     boolean raw, boolean interactive) throws IOException {



              // Collect enough docs to show 5 pages

              TopDocs results = searcher.search(query, 5 * hitsPerPage);

              ScoreDoc[] hits = results.scoreDocs;



              int numTotalHits = results.totalHits;

              System.out.println(numTotalHits + " total matching documents");



              int start = 0;

              int end = Math.min(numTotalHits, hitsPerPage);



              hits = searcher.search(query, numTotalHits).scoreDocs;

              end = Math.min(hits.length, start + hitsPerPage);



              for (int i = start; i < end; i++) {

                     Document doc = searcher.doc(hits[i].doc);

                     String path = doc.get("path");

                     System.out.println((i + 1) + ". " + path);

                     query.rewrite(reader);

              }

       }

}
Evert  Wagenaar

On Tue, Oct 25, 2016 at 1:58 AM, Evert Wagenaar <evert.wagenaar@gmail.com<mailto:evert.wagenaar@gmail.com>>
wrote:
Thanks Allison. I will try it.


Op maandag 24 oktober 2016 heeft Allison, Timothy B. <tallison@mitre.org<mailto:tallison@mitre.org>>
het volgende geschreven:
Make sure to setRewriteMethod on the MultiTermQuery to:
 MultiTermQuery.SCORING_BOOLEAN_REWRITE or CONSTANT_SCORE_BOOLEAN_REWRITE

Then something like this should work:

        q = q.rewrite(reader);

        Set<Term> terms = new HashSet<>();
        Weight weight = q.createWeight(searcher, false);

        weight.extractTerms(terms);



-----Original Message-----
From: Evert Wagenaar [mailto:evert.wagenaar@gmail.com]
Sent: Monday, October 24, 2016 12:41 PM
To: java-user@lucene.apache.org
Subject: How to get the terms matching a WildCardQuery in Lucene 6.2?

I already asked this on StackOverflow. Unfortunately without any answer for over a week now.

Therefore again to the real experts:


I downloaded a list of 350.000 English words in a .txt file and Indexed it using the latest
Lucene (6.2). I want to apply wildcard queries like aard???? and then retreive a list of matches.

I've done this before in an older version of Lucene. Here it was pretty simple. I just had
to do a Query.rewrite() and this retuned what I needed.
Unfortunately in 6.2 this doesn't work anymore. There is a Query.rewrite(Indexreader reader)
which should return a HashMap of Terms.
In my case there's only one matching Term (aardvark). The Searcher returns one hit, containing
the Document path to the wordlist. The HashMap is however empty.

When I change the Query to find more then one single match (like aa*) the HashMap remains
empty.

I tried the MatchExtractor too. Unfortunately without result.

The Objective of this is to demonstrate the power of Lucene to easily find words of a particular
length, given one or more characters. I'm pretty sure I can do this using regular expressions
in Java but then it's outside my objective.

Can anyone tell me why this isn't working? I use the StandardAnalyzer.
Should I use a different Application?

Any help is greatly appreciated.

Thanks.



--
Sent from Gmail IPad


--
Sent from Gmail IPad

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message