spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Capwell <dcapw...@gmail.com>
Subject Re: Encoder with empty bytes deserializes with non-empty bytes
Date Thu, 22 Feb 2018 06:18:44 GMT
Ok found my issue

case c if c == classOf[ByteString] =>
  StaticInvoke(classOf[Protobufs], ArrayType(ByteType),
"fromByteString", parent :: Nil)

Should be

case c if c == classOf[ByteString] =>
  StaticInvoke(classOf[Protobufs], BinaryType, "fromByteString", parent :: Nil)



This causes the java code to see a byte[] which uses a different code path
than linked.  Since I did ArrayType(ByteTyep) I had to wrap the data in a
ArrayData class



On Wed, Feb 21, 2018 at 9:55 PM, David Capwell <dcapwell@gmail.com> wrote:

> I am trying to create a Encoder for protobuf data and noticed something
> rather weird.  When we have a empty ByteString (not null, just empty), when
> we deserialize we get back a empty array of length 8.  I took the generated
> code and see something weird going on.
>
> UnsafeRowWriter
>
>
>    1.
>
>    public void setOffsetAndSize(int ordinal, long currentCursor, long size) {
>
>    2.
>
>        final long relativeOffset = currentCursor - startingOffset;
>
>    3.
>
>        final long fieldOffset = getFieldOffset(ordinal);
>
>    4.
>
>        final long offsetAndSize = (relativeOffset << 32) | size;
>
>    5.
>
>    6.
>
>        Platform.putLong(holder.buffer, fieldOffset, offsetAndSize);
>
>    7.
>
>      }
>
>
>
> So this takes the size of the array and stores it... but its not the array
> size, its how many bytes were added
>
> rowWriter2.setOffsetAndSize(2, tmpCursor16, holder.cursor - tmpCursor16);
>
>
>
> So since the data is empty the only method that moves the cursor forward is
>
> arrayWriter1.initialize(holder, numElements1, 8);
>
> which does the following
>
> holder.cursor += (headerInBytes + fixedPartInBytes);
>
> in a debugger I see that headerInBytes = 8 and fixedPartInBytes = 0.
>
> Here is the header write
>
>
>    1.
>
>    Platform.putLong(holder.buffer, startingOffset, numElements);
>
>    2.
>
>        for (int i = 8; i < headerInBytes; i += 8) {
>
>    3.
>
>          Platform.putLong(holder.buffer, startingOffset + i, 0L);
>
>    4.
>
>        }
>
>
>
>
> Ok so so far this makes sense, in order to deserialize you need to know
> about the data, so all good. Now to look at the deserialize path
>
>
> UnsafeRow.java
>
> @Override
> public byte[] getBinary(int ordinal) {
>   if (isNullAt(ordinal)) {
>     return null;
>   } else {
>     final long offsetAndSize = getLong(ordinal);
>     final int offset = (int) (offsetAndSize >> 32);
>     final int size = (int) offsetAndSize;
>     final byte[] bytes = new byte[size];
>     Platform.copyMemory(
>       baseObject,
>       baseOffset + offset,
>       bytes,
>       Platform.BYTE_ARRAY_OFFSET,
>       size
>     );
>     return bytes;
>   }
> }
>
>
>
> Since this doesn't read the header to return the user-bytes, it tries to
> return header + user-data.
>
>
>
> Is this expected? Am I supposed to filter out the header and force a
> mem-copy to filter out for just the user-data? Since header appears to be
> dynamic, how would I know the header length?
>
> Thanks for your time reading this email.
>
>
> Spark version: spark_2.11-2.2.1
>

Mime
View raw message