lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef
Date Wed, 22 Jun 2011 15:50:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053319#comment-13053319
] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

Well, personally i am hesitant to introduce any encodings or bytes into our current analysis
chain, because its unnecessary complexity that will introduce bugs (at the moment, its the
users responsibility to create the appropriate Reader etc).

Furthermore, not all character sets can be 'corrected' with a linear conversion like this:
for example some actually order the text in a different direction, and things like that...
there are many quirks to non-unicode character sets.

Maybe as a start, it would be useful to prototype some simple experiments with a "binary analysis
chain" and hackup a highlighter to work with them? This way we would have an understanding
of what the potential performance gain is.

Here's some example code for a dead simple binary analysis chain that only uses bytes the
whole way through, you could take these ideas and prototype one with just all ascii-terms
and split on the space byte and such:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestBinaryTerms.java
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/BinaryTokenStream.java



> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not
as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message