lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <>
Subject Re: BytesRef violates the principle of least astonishment
Date Wed, 20 May 2015 06:49:42 GMT
Yes, BytesRef can be surprising. No, it probably won't change in
Lucene to comply with superb design principles. Yes, the odd design is
there for performance reasons and it does provide noticeable gain.

Perhaps you could file a JIRA issue to improve the documentation, this
would be helpful. For what it's worth, clone() is a covariant in
BytesRef and it explicitly says:

   * Returns a shallow clone of this instance (the underlying bytes are
   * <b>not</b> copied and will be shared by both the returned object and this
   * object.
   * @see #deepCopyOf


On Wed, May 20, 2015 at 6:19 AM, Trejkaz <> wrote:
> Hi all.
> The Lucene 4 migration guide "helpfully" suggests to work with
> BytesRef directly rather than converting to string, but I disagree.
> Take the following example of building up a List<Term> by iterating a
> TermsEnum. I think it is written in a fairly straight-forward fashion.
> I added some println which aren't really there, to illustrate the
> place I have my breakpoints.
>     protected List<Term> toList(String field, TermsEnum termsEnum)
> throws IOException {
>         List<Term> terms = new LinkedList<>();
>         BytesRef text;
>         //noinspection NestedAssignment
>         while((text = != null) {
>             Term term = new Term(field, text);
>             System.out.println("in loop: " + term);
>             terms.add(term);
>         }
>         System.out.println("at end: " + terms);
>         return terms;
>     }
> When you actually try to call this, weird shit happens.
>     in loop: content:term
>     at end: [content:testing]
>     in loop: content:extractor
>     at end: [content:for]
> Basically, by the time you exit the while loop, the BytesRef you put
> into the Term has changed to point to the next term in the index. So
> okay, so BytesRef is mutable. I hate mutable stuff, but luckily we
> have clone() on this class, so I'll just clone it when creating the
> term:
>             Term term = new Term(field, text.clone());
> Now the output is:
>     in loop: content:term
>     at end: [content:test]
>     in loop: content:extractor
>     at end: [content:forractor]
> WTF?
> Now it seems like it clones the length of the slice but not the actual
> data, and the actual data has still changed underneath it. Great. So
> basically, the only safe way to use BytesRef is to treat it like a hot
> potato and immediately call utf8ToString() to get hold of an object
> you can trust.
>             Term term = new Term(field, text.utf8ToString());
> And then finally you get:
>     in loop: content:term
>     at end: [content:term]
>     in loop: content:extractor
>     at end: [content:extractor]
> I will probably eventually formalise this in our code and making
> utility wrappers which don't expose BytesRef to the caller, since it's
> so easy to do the wrong thing with it.
> They say a good measure of the quality of a library is the number of
> times you say "WTF" while trying to figure out how to use it. I have
> already lost count.
> TX
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message