lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Mon, 10 May 2010 16:52:18 GMT


Robert Muir commented on LUCENE-2167:

I assume you don't mean to say that English and European languages are not real languages

I think the heuristics I am talking about that are in StandardTokenizer today, that don't
really even work*,
shouldn't have a negative effect on other languages, thats all. 

I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called
StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.

It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en
package to get reasonable performance for her language.

fyi, GreekAnalyzer didn't even use this stuff until 3.1 (it omitted StandardFilter).

But I don't think it matters where we put the "western" tokenizer, as long as its not StandardTokenizer.
I don't really even care too much about the stuff it does honestly, I don't consider it very
important, nor very
accurate, only the source of many jira bugs* and hassle and confusion (invalidAcronym etc).

Just seems to be more trouble than its worth.

* LUCENE-1438
* LUCENE-2244
* LUCENE-1787
* LUCENE-1403
* LUCENE-1100
* LUCENE-1556
* LUCENE-571
* LUCENE-1068
* i stopped at this point, i think this is enough examples

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>                 Key: LUCENE-2167
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message