lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6993) Update TLDs to latest list
Date Thu, 18 Feb 2016 00:01:18 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151457#comment-15151457
] 

Robert Muir commented on LUCENE-6993:
-------------------------------------

Basically the old versions of the Tokenizer and Impl are just "saved" to a subdirectory, and
in the Analyzer and TokenizerFactory we conditionally use them, if you request that compatibility
version.

Have a look at branch_5x which still has {{std40}} containing StandardTokenizer40, StandardTokenizerImpl40,
UAX29URLEmailTokenizer40, and so on. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also
have a testBackcompat40 which calls {{setVersion}} and ensures it works. Finally, see StandardAnalyzer/TokenizerFactory.java,
and UAXURLEmailAnalyzer/TokenizerFactory.java which conditionally use StandardTokenizer40
depending on version.

So we should do a similar thing with the current stuff in master before modifying the files,
and make them {{std55}}. We can just test that it works at all (e.g. foo bar -> foo,bar)
initially and later maybe add a test ensuring "old behavior" stays the same.

Then you can bump unicode version and tld lists and it won't change any behavior if someone
asks for version < 6.0, because they will get the exact same tokenizer as before.

> Update TLDs to latest list
> --------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the list of TLDs
again. Comparing our old list with a new list indicates 800+ new domains, so it would be nice
to include them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message