lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Is StandardAnalyzer good enough for multi languages...
Date Tue, 08 Jan 2013 23:57:25 GMT
Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you,
along with the token filters in that same module. - Steve
 
On Jan 8, 2013, at 6:43 PM, Trejkaz <trejkaz@trypticon.org> wrote:

> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantoshi76@gmail.com> wrote:
>> DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
>> indexing (since we are using java, I think the content is converted to UTF-8
>> before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something
> which it should break. Some of these cases even result in undesirable
> behaviour for English, so I would be surprised if there were even a
> single language which it handles acceptably.
> 
> It does follow "Unicode standards" for how to tokenise text, but these
> standards were written by people who didn't quite know what they were
> doing so it's really just passing the buck. I don't think Lucene
> should have chosen to follow that standard in the first place, because
> it rarely (if ever) gives acceptable results.
> 
> The worst examples for English, at least for us, were that it does not
> break on colon (:) or underscore (_).
> 
> Colon was explained by some languages using it like an apostrophe.
> Personally I think you should break on an apostrophe as well, so I'm
> not really happy with this reasoning, but OK.
> 
> Underscore was completely baffling to me so I asked someone at Unicode
> about it. They explained that it was because it was "used by
> programmers to separate words in identifiers". This explanation is
> exactly as stupid as it sounds and I hope they will realise their
> stupidity some day.
> 
>> or do we need to use special analyzers for each of the language.
> 
> I do think that StandardTokenizer at least can form a good base for an
> analyser. You just have to add a ton of filters to fix each additional
> case you find where people don't like it. For instance, it returns
> runs of Katakana as a single token, but if you did that, people
> wouldn't find what they are searching for, so you make a filter to
> split that back out into multiple tokens.
> 
> It would help if there were a single, core-maintained analyser for
> "StandardAnalyzer with all the things people hate fixed"... but I
> don't know if anyone is interested in maintaining it.
> 
>> In this case, if a document has a mixed case ( english +
>> Japanese), what analyzer should we use and how can we figure it out
>> dynamically before indexing?
> 
> Some language detection libraries will give you back the fragments in
> the text and tell you which language is used for each fragment, so
> that is totally a viable option as well. You'd just make your own
> analyser which concatenates the results.
> 
>> Also, while searching if the query text contains (both english and
>> Japanese), how does this work? Any criteria in choosing the analyzers?
> 
> I guess you could either ask the user what language they're searching
> in or look at what characters are in their query and decide which
> language(s) it matches and build the query from there. It might match
> multiple...
> 
> TX
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message