[ https://issues.apache.org/jira/browse/LUCY-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Wellnhofer reassigned LUCY-196: ------------------------------------ Assignee: Nick Wellnhofer > UAX #29 tokenizer > ----------------- > > Key: LUCY-196 > URL: https://issues.apache.org/jira/browse/LUCY-196 > Project: Lucy > Issue Type: New Feature > Components: Analysis > Reporter: Nick Wellnhofer > Assignee: Nick Wellnhofer > Priority: Minor > Fix For: 0.3.0 (incubating) > > > It would be nice to have a default tokenizer in core. A tokenizer based on the Unicode word boundaries defined in UAX #29 Unicode Text Segmentation seems like a good choice. That's also how Lucene's StandardTokenizer works. > See the following thread on lucy-dev > http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/browser > Also see > http://unicode.org/reports/tr29/#Word_Boundaries -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira