lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Mon, 17 May 2010 12:15:46 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868194#action_12868194
] 

Steven Rowe commented on LUCENE-2167:
-------------------------------------

bq. Yeah we should regen all jflex files when pathing this (ant jflex does this automatically,
so we dont need to care). Removing the hack from StandardTokenizers jflex file should be done
in an issue, but it also does not hurt if the hack stays in code.

Agreed.  I was thinking since Robert is moving StandardTokenizer that the regen could wait
until afterward.

bq. Checking the jflex version is hard to do, i think about it, maybe there is an ANT trick.
Is the version noted somewhere in a class file as constant?

Release version is, I think, but we're using an unreleased version ATM.  Hmm, for the SVN
checkout, maybe the .svn/entries file could be checked or something?  If we go that route
(and I think it's probably not a good idea), we should instead maybe be "svn up"ing the checkout?

bq. I think we should simply reopen LUCENE-2384 (its part of 3x and trunk)

+1

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message