fluo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ctubbsii <...@git.apache.org>
Subject [GitHub] incubator-fluo-website pull request #41: Blog post about immutable bytes
Date Wed, 09 Nov 2016 21:28:56 GMT
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/incubator-fluo-website/pull/41#discussion_r87280336
  
    --- Diff: _posts/blog/2016-11-09-immutable-bytes.md ---
    @@ -0,0 +1,180 @@
    +---
    +title: "Java needs an immutable byte string"
    +date: 2016-11-09 11:43:00 +0000
    +author: Keith Turner
    +---
    +
    +## Byte Sequences in Java
    +
    +Working with byte arrays in Java can be painful.  To work around this Fluo created [Bytes]
and
    +[BytesBuilder].  Bytes is an immutable wrapper around a byte array.  Bytes has good implementations
    +of `hashCode()`, `equals()`, and `compareTo()`.  These functions and its immutability
make it
    +suitable to use as a map key.  If you have ever tried to use a byte array as a map key
in Java you
    +quickly realize that you need to write a wrapper class.
    +
    +Fluo should not have to create these classes. Java really should offer something out
the box as part
    +of its standard library.  However it does not offer any good class for this common case.
    +
    +## Why not use String?
    +
    +Trying to stuff arbitrary binary data in a String can corrupt the data.  The following
little
    +program shows this, it will print `false`.
    +
    +```java
    +    byte bytes1[] = new byte[256];
    +    for(int i = 0; i<255; i++)
    +      bytes1[i] = (byte)i;
    +
    +    byte bytes2[] = new String(bytes1).getBytes();
    +    
    +    System.out.println(Arrays.equals(bytes1, bytes2));
    +```
    +
    +String can be made to work by specifying a character set. The following program will
print `true`.
    +However this is error prone and inefficient.  Using this method results in copying between
byte arrays
    +and internal string char arrays.
    +
    +```java
    +    byte bytes1[] = new byte[256];
    +    for(int i = 0; i<255; i++)
    +      bytes1[i] = (byte)i;
    +
    +    String str = new String(bytes1, StandardCharsets.ISO_8859_1);
    +    byte bytes2[] = str.getBytes(StandardCharsets.ISO_8859_1);
    +    
    +    System.out.println(Arrays.equals(bytes1, bytes2));
    +```
    +
    +## Why not use ByteBuffer?
    +  
    +A read only ByteBuffer might seem like it would fit the bill of an immutable byte array
wrapper.
    +However, the following program shows two ways that ByteBuffer falls short.  ByteBuffers
are great
    +for I/O, but it would not be prudent to use them as map keys.
    +
    +```java
    +    byte[] bytes1 = new byte[] {1,2,3,(byte)250};
    +    ByteBuffer bb1 = ByteBuffer.wrap(bytes1).asReadOnlyBuffer();
    +
    +    System.out.println(bb1.hashCode());
    +    bytes1[2]=89;
    +    System.out.println(bb1.hashCode());
    +    bb1.get();
    +    System.out.println(bb1.hashCode());
    +```
    +
    +The program above prints the following, which is less than ideal when using a
    +ByteBuffer as a HashMap key :
    +
    +```
    +747721
    +830367
    +26786
    +```
    +
    +This little program shows two things.  First, the only guarantee we are getting from
    +`asReadOnlyBuffer()` is that `bb1` can not be used to modify `bytes1`.  However, the
originator of
    +the read only buffer can still modify the wrapped byte array.   Java's String and Fluo's
Bytes avoid
    +this by always copying data into an internal private array that never escapes.
    +
    +The second issue is that `bb1` has a position and calling `bb1.get()` changes this position.
    +Changing the position conceptually changes the contents of the ByteBuffer.  This is why
`hashCode()`
    +returns something different after `bb1.get()` is called.  So even though `bb1` does not
enable
    +mutating `bytes1`, `bb1` is itself mutable.  One might think that calling `map.put(bb1.duplicate(),
    +aValue)` would avoid issues.  However, any code iterating over a maps keys could mutate
the
    +ByteBuffers position.
    +
    +## Why not use Protobufs ByteString?
    +
    +[Protocol Buffers][pb] has a beautiful implementation of an immutable byte array wrapper
called
    +[ByteString].  I would encourage its use when possible.  However in Fluo's case its not
really
    +appropriate to use for two reasons.  First any library designer should try to minimize
what
    +transitive dependencies they force on users.  Internally Fluo does not currently use
Protocol
    +Buffers in its implementation, so this would be a new dependency for Fluo users.  The
second reason
    +is going to require some background to explain.
    +
    +Technologies like [OSGI] and [Jigsaw] seek to modularize Java libraries and provide dependency
    +isolation.  Dependency isolation allows a user to use a library without having to share
a libraries
    +dependencies.  For example, consider the following hypothetical scenario.
    +
    + * Fluo's implementation uses Protobuf version 2.5
    + * Fluo user code uses Protobuf version 1.8
    +
    +Without dependency isolation the user must converge dependencies and make their application
and
    +Fluo's implementation use the same version of Protobuf.  Sometimes this works without
issue, but
    +sometimes things will break because Protobuf dropped, changed, or added a method.
    +
    +With dependency isolation, Fluo's implementation and Fluo user code can easily use different
versions
    +of Protobuf.  This is only true as long as Fluo's API does not use Protobuf.  So this
the second
    +reason that Fluo should not use classes from Protobuf in its API.  If Fluo used Protobuf
in its API
    +then it forces the user to have to converge dependencies, even if they are using OSGI
or Jigsaw. 
    +
    +## What about the copies?
    +
    +As mentioned earlier, an Immutable type requires a defensive copy at creation time. 
When we were
    +designing Fluo's API we were worried about this at first.  However a simple truth became
apparent.
    +If the API took a mutable type, then all boundary points between the user and Fluo would
require
    +defensive copies.  For example assume Fluo's API took byte arrays and consider the following
code.
    +
    +```java
    +//A Fluo transaction
    +Transaction tx = ...
    +byte[] row = ...
    +
    +tx.set(row, col1, v1)
    +tx.set(row, col2, v2)
    +tx.set(row, col3, v3)
    +```
    +
    +Fluo will buffer changes until a transaction is committed.  In the example above since
Fluo accepts
    +a mutable row, it would be prudent to do a defensive copy each time set is called above.
 
    +
    +In the code below where an immutable byte array wrapper is used, the calls to set do
not need to do
    +defensive copy.  So when comparing the two examples, the immutable byte wrapper results
in less
    +defensive copies.
    +
    +```java
    +//A Fluo transaction
    +Transaction tx = ...
    +Bytes row = ...
    +
    +tx.set(row, col1, v1)
    +tx.set(row, col2, v2)
    +tx.set(row, col3, v3)
    +```
    +
    +## Improving the situation
    +
    +So far, the following arguments have been presented.
    --- End diff --
    
    End sentence with colon :


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message