lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Fri, 14 May 2010 22:48:43 GMT


Robert Muir commented on LUCENE-2167:

Hmm i ran some tests, I think i see your problem.

I tried this:
  public void testThai() throws Exception {
    assertAnalyzesTo(a, "ภาษาไทย", new String[] { "ภาษาไทย" });

The reason you get something different than the unicode site, is because recently? these have
Instead anything that needs a dictionary or whatever is identified by [:Line_Break=Complex_Context:]
You can see this mentioned in the standard:

In particular, the characters with the Line_Break property values of Contingent_Break (CB),

Complex_Context (SA/South East Asian), and XX (Unknown) are assigned word boundary property

values based on criteria outside of the scope of this annex. 

In ICU, i noticed the default rules do this:
$dictionary   = [:LineBreak = Complex_Context:];
$dictionary $dictionary

(so they just stick together with this chained rule)

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>                 Key: LUCENE-2167
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message