lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Messer <>
Subject Re: English and French documents together / analysis, indexing, searching
Date Thu, 20 Jan 2005 19:15:51 GMT

>> you could try to create a more complex query and expand it into both 
>> languages using different analyzers. Would this solve your problem ?
> Would that mean I would have to actually conduct two searches (one in 
> English and one in French) then merge the results and display them to 
> the user?
> It sounds to me like a long way around, so then actually writing an 
> analyzer that has the language guesser might be a better solution on 
> the long run?

It's no problem to guess the language based on the document corpus. But 
how do you want to guess the language of a simple Term Query ? What if 
your users are searching for names like "George Bush" ? You can't guess 
the language of such a query and you have to expand it into both 
languages. I don't see an easier way for solving that problem.

>> This is a behaviour is implemented in StandardTokenizer used by 
>> StandardAnalyzer. Look at the documentation of StandardTokenizer:
>> Many applications have specific tokenizer needs.  If this tokenizer 
>> does not suit your application, please consider copying this source code
>> directory to your project and maintaining your own grammar-based 
>> tokenizer.
> Hmm I feel this is beyond my abilities at the moment, writing my own 
> tokenizer, without more in-depth knowledge of everything else.
> Perhaps I'll try taking the StandardTokenizer and expand it or change 
> it based on other tokenziers available in Lucene such as 
> WhiteSpaceTokenizer.

What's about using the WhitespaceAnalyzer directly ? Maybe this fits 
more into your requirement and you could use it for both languages.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message