lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Searching doubt
Date Tue, 04 Aug 2009 15:27:43 GMT
Just catching this thread, but if I understand what is being asked I can
share how I do multi-word phrase matching. If that's not what's wanted,

Ok, I load an entire dictionary into a lucene index, phrases and all.

When I'm scanning some text, I do lookups in this dictionary index using
one word at a time with the word _at the beginning_ of the indexed field
only. This returns all words/phrases beginning with the word I searched

I then scan the rest of the input text and compare it to the longest
matching phrase in my lucene results. That then becomes a meaningful

Input text:
"The President of the United States lives in the White House"

"President of the United States"
"White House"

Term: "President"
"President of a Company"
"President of the United States"

Take the longest match.


> On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<> wrote:
>> 2) Use a dictionary (real dictionary), and search it for every
>> substring,
>> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it
>> there.
>> This needs some fine tuning, like checking if the rest is also a word
>> and if
>> the full string is also a word, so that you don't break up meaningful
>> words.
>> You'll need to get a dictionary for that.
> I do not have a solution to this, but it strikes me as very similar to
> they way you traverse Japanese to break words, since that has no
> spaces. Is there a Japanese tokenizer and, if so, does it handle this?
> If so, you could replace the Japanese dictionary with an English
> dictionary. Just a random thought had that might / might not help.
> Phil
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message