lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Krišto <>
Subject Re: Questions about doing a full text search with numeric values
Date Wed, 03 Jul 2013 07:33:50 GMT
On 07/01/2013 12:22 PM, Erick Erickson wrote:
> WordDelimiterFilter(Factory if you're experimenting with
> Solr as Jack suggests) will fix a number of your cases since
> it splits on case change and numeric/alpha changes.

If WordDelimiterFilter doesn't help, maybe you could take a look at
n-gram tokenizer (org.apache.lucene.analysis.ngram.NGramTokenizer).

It simply takes the word and splits it in n-char stream. For example:
Tokenizer ngramTok = new NGramTokenizer(reader, 3, 3);
will tokenize word "tokenizer" as:
"tok", "oke", "ken", "eni", "niz", "ize", zer"
(constructor is: NGramTokenizer(Reader input, int minGram, int maxGram))
You could set minGram to 1 so it gets those corner cases as "3tigers" by
searching for 3 (but, WordDelimiterFilter should solve this).

This should provide more "notepad-like find" ability as it is able to
search for part of the word, but it will introduce more noise in search
Also, it will deal with Erick Erickson's example:
> That won't deal with this example though: 000000123456.

    Ivan Krišto

> On Thu, Jun 27, 2013 at 1:47 PM, Jack Krupansky <>wrote:
>> Do continue to experiment with Solr as a "testbed" - all of the analysis
>> filters used by Solr are... part of Lucene, so once you figure things out
>> in Solr (using the Solr Admin UI analysis page), you can mechanically
>> translate to raw Lucene API calls.
>> Look at the standard tokenizer, it should do a better job with punctuation.
>> -- Jack Krupansky
>> -----Original Message----- From: Todd Hunt
>> Sent: Thursday, June 27, 2013 1:14 PM
>> To:
>> Subject: Questions about doing a full text search with numeric values
>> I am working on an application that is using Tika to index text based
>> documents and store the text results in Lucene.  These documents can range
>> anywhere from 1 page to thousands of pages.
>> We are currently using Lucene 3.0.3.  I am currently using the
>> StandarAnalyzer to index and search for the text that is contained in one
>> Lucene document field.
>> For strictly alpha based, English words, the searches return the results
>> as expected.  The problem has to do with searching for numeric values in
>> the indexed documents.  So examples of text in the documents that cannot be
>> found unless wild cards are used are:
>> Ø
>> o   800 does not find the text above
>> Ø  $118.30
>> o   118 does not find the text above
>> Ø  3tigers
>> o   3 does not find the text above
>> Ø  000000123456
>> o   123456 does not find the text above
>> Ø  123,abc,foo,bar,456
>> o   This is in a CSV file
>> o   123 nor 456 finds the text above
>> §  I realize that it has to do with the texted only being separated by
>> commas and so it is treated as one token, but I think the issue is no
>> different than the others
>> The expectation from our users is that if they can open the document in
>> its default application (Word, Adobe, Notepad, etc.) and perform a "find"
>> within that application and find the text, then our application based on
>> Lucene should be able to find the same text.
>> It is not reasonable for us to request that our users surround their
>> search with wildcards.  Also, it seems like a kludge to programmatically
>> put wild cards around any numeric values the user may enter for searching.
>> Is there some type of numeric parser or filter that would help me out with
>> these scenarios?
>> I've looked at Solr and we already have strong foundation of code
>> utilizing Spring, Hibernate, and Lucene.  Trying to integrate Solr into our
>> application would take too much refactoring and time that isn't available
>> for this release.
>> Also, since these numeric values are embedded within the documents, I
>> don't think storing them as their own field would make sense since I want
>> to maintain the context of the numeric values within the document.
>> Thank you.
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**<>
>> For additional commands, e-mail: java-user-help@lucene.apache.**org<>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message