lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Johansson <>
Subject Re: QueryParser and compound words
Date Wed, 12 Mar 2003 08:19:23 GMT
Well, the problem arise when a user enters a query with a compound word
and the compound word itself is not indexed, only one of its parts.

For example the index contains a document with the following word:
fotboll (football).

Let's say the users searches for fotbollsmatch (football game). The word
is split into fotboll and match and the phrase "fotboll match" is 
searched for.
The user finds no matching document.

Comparing this to english the user would have found a document, however 
slightly lower than a document containing both the words football and game.

I agree with you that this might not be a problem. The user could be 
to reformulate his query. However the behaviour for an english index and 
a swedish
index would be different.


Tatu Saloranta wrote:

>On Tuesday 11 March 2003 03:05, Magnus Johansson wrote:
>>I have written an Analyzer for swedish. Compound words are common in
>>swedish, therefore my Analyzer tries to split the compound words
>>into its parts. For example the swedish word fotbollsmatch (football
>>game) is split into fotboll and match.
>(same applies to many other languages so this is a common problem I think).
>However... I'm not sure why you consider this a problem? The reason quotes
>are added is that since a single token (as parsed by QueryParser) expands no
>multiple terms, it becomes a PhraseQuery. Same happen (should happen)
>during indexing, so end result should match word in both "normal" case (word 
>is correctly spelled as compound word) and when word is (incorrectly) spelled 
>with spaces?
>As to quotes; they are only shown when converting query to a String; 
>internally there are no quotes to be matched.
>-+ Tatu +-
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message