lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Krišto <ivan.kri...@gmail.com>
Subject Re: Questions about doing a full text search with numeric values
Date Wed, 03 Jul 2013 07:33:50 GMT
On 07/01/2013 12:22 PM, Erick Erickson wrote:
> WordDelimiterFilter(Factory if you're experimenting with
> Solr as Jack suggests) will fix a number of your cases since
> it splits on case change and numeric/alpha changes.

If WordDelimiterFilter doesn't help, maybe you could take a look at
n-gram tokenizer (org.apache.lucene.analysis.ngram.NGramTokenizer).

It simply takes the word and splits it in n-char stream. For example:
Tokenizer ngramTok = new NGramTokenizer(reader, 3, 3);
will tokenize word "tokenizer" as:
"tok", "oke", "ken", "eni", "niz", "ize", zer"
(constructor is: NGramTokenizer(Reader input, int minGram, int maxGram))
You could set minGram to 1 so it gets those corner cases as "3tigers" by
searching for 3 (but, WordDelimiterFilter should solve this).

This should provide more "notepad-like find" ability as it is able to
search for part of the word, but it will introduce more noise in search
results.
Also, it will deal with Erick Erickson's example:
> That won't deal with this example though: 000000123456.

  Regards,
    Ivan Krišto

> On Thu, Jun 27, 2013 at 1:47 PM, Jack Krupansky <jack@basetechnology.com>wrote:
>
>> Do continue to experiment with Solr as a "testbed" - all of the analysis
>> filters used by Solr are... part of Lucene, so once you figure things out
>> in Solr (using the Solr Admin UI analysis page), you can mechanically
>> translate to raw Lucene API calls.
>>
>> Look at the standard tokenizer, it should do a better job with punctuation.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Todd Hunt
>> Sent: Thursday, June 27, 2013 1:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Questions about doing a full text search with numeric values
>>
>>
>> I am working on an application that is using Tika to index text based
>> documents and store the text results in Lucene.  These documents can range
>> anywhere from 1 page to thousands of pages.
>>
>> We are currently using Lucene 3.0.3.  I am currently using the
>> StandarAnalyzer to index and search for the text that is contained in one
>> Lucene document field.
>>
>> For strictly alpha based, English words, the searches return the results
>> as expected.  The problem has to do with searching for numeric values in
>> the indexed documents.  So examples of text in the documents that cannot be
>> found unless wild cards are used are:
>>
>> Ø  1-800-costumes.com
>>
>> o   800 does not find the text above
>>
>> Ø  $118.30
>>
>> o   118 does not find the text above
>>
>> Ø  3tigers
>>
>> o   3 does not find the text above
>>
>> Ø  000000123456
>>
>> o   123456 does not find the text above
>>
>> Ø  123,abc,foo,bar,456
>>
>> o   This is in a CSV file
>>
>> o   123 nor 456 finds the text above
>>
>> §  I realize that it has to do with the texted only being separated by
>> commas and so it is treated as one token, but I think the issue is no
>> different than the others
>>
>> The expectation from our users is that if they can open the document in
>> its default application (Word, Adobe, Notepad, etc.) and perform a "find"
>> within that application and find the text, then our application based on
>> Lucene should be able to find the same text.
>>
>> It is not reasonable for us to request that our users surround their
>> search with wildcards.  Also, it seems like a kludge to programmatically
>> put wild cards around any numeric values the user may enter for searching.
>>
>> Is there some type of numeric parser or filter that would help me out with
>> these scenarios?
>>
>> I've looked at Solr and we already have strong foundation of code
>> utilizing Spring, Hibernate, and Lucene.  Trying to integrate Solr into our
>> application would take too much refactoring and time that isn't available
>> for this release.
>>
>> Also, since these numeric values are embedded within the documents, I
>> don't think storing them as their own field would make sense since I want
>> to maintain the context of the numeric values within the document.
>>
>> Thank you.
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message