lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Thu, 04 Jan 2007 03:16:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122
] 

Chuck Williams commented on LUCENE-510:
---------------------------------------

Has an improvement been made to eliminate the reported 20% indexing hit?  That would be a
big price to pay.

To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc()
with a FieldSelector) are much more important than standard UTF-8 compliance.

A 20% hit seems suprising.  The pre-scan over the string to be written shouldn't cost much
compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed
field).

In case it is relevant, I had a related issue in my bulk updater, a case where a vint required
at the beginning of a record by the lucene index format was not known until after the end.
 I solved this with a fixed length vint record that was estimated up front and revised if
necessary after the whole record was processed.  The vint representation still works if more
bytes than necessary are written.


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string
is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
> We must increment the file format number to indicate this change.  At least the format
number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is
released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated
features).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message