kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Graham <sggraha...@gmail.com>
Subject Re: Trying to understand the format of the LogSegment file.
Date Fri, 04 Dec 2015 13:43:59 GMT
Very helpful, thanks Magnus.

Should the documentation found in http://kafka.apache.org/documentation.html#messages be updated
to reflect this format for messages?

So, to close this one out, this is the section in the protocol guide 
(https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets)
I used:
Variable Length Primitives
bytes, string - These types consist of a signed integer giving a length N followed by N bytes
of content. A length of -1 indicates null. string uses an int16 for its size, and bytes uses
an int32.

…

MessageSet => [Offset MessageSize Message]
  Offset => int64
  MessageSize => int32

Message => Crc MagicByte Attributes Key Value
  Crc => int32
  MagicByte => int8
  Attributes => int8
  Key => bytes
  Value => bytes

And therefore I can interpret the bytes from the hex dump as follows:

Bytes 0-7 (00 00 00 00 00 00 00 00) MessageSet offset
Bytes 8-11 (00 00 00 16) MessageSetSize (crc (4) + magicbyte (1) + attributes (1) + key (4+0)
+ value (4+8) = 22 decimal, 16 hex
Bytes 12-15 (e9 75 38 bc) CRC
Byte  16 (00) magic byte
Byte  17 (00) attributes
Bytes 18-21 (ff ff ff ff) length field of the key, with -1 meaning key is null, no key bytes
follow
Bytes 22-25 (00 00  00 08) length field of the value
Bytes 26-33 (6d 65 73 73 61 67 65 31) the value bytes

> On Dec 3, 2015, at 4:14 PM, Magnus Edenhill <magnus@edenhill.se> wrote:
> 
> Hi,
> 
> messages are stored on disk in the Kafka (network) protocol format, so if
> you have a look at the protocol guide you'll see the pieces start coming
> together:
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets
> 
> Regards,
> Magnus
> 
> 
> 
> 2015-12-03 18:18 GMT+01:00 Steve Graham <sggraham64@gmail.com>:
> 
>> I am attempting to understand the details of the content of the log
>> segment file in Kafka.
>> 
>> The documentation (http://kafka.apache.org/081/documentation.html#log)
>> suggests:
>> The exact binary format for messages is versioned and maintained as a
>> standard interface so message sets can be transfered between producer,
>> broker, and client without recopying or conversion when desirable. This
>> format is as follows:
>> 
>> On-disk format of a message
>> 
>> message length : 4 bytes (value: 1+4+n)
>> "magic" value  : 1 byte
>> crc            : 4 bytes
>> payload        : n bytes
>> 
>> 
>> But I am struggling to map the documentation to what I see on the disk.
>> 
>> I created a topic, named simple-topic, and added one message to it (via
>> the console producer).  The message payload was “message1”.
>> 
>> The DumpLogSegments tool shows:
>> Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log
>> Starting offset: 0
>> offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec:
>> NoCompressionCodec crc: 3916773564
>> 
>> Taking a hex dump of the (only) log file:
>> sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 16 e9 75 38 bc
>> |.............u8.|
>> 00000010  00 00 ff ff ff ff 00 00  00 08 6d 65 73 73 61 67
>> |..........messag|
>> 00000020  65 31                                             |e1|
>> 00000022
>> 
>> I tried to “reverse engineer” the contents, to see how it corresponds to
>> the documentation:
>> 
>> Bytes 0-7 (00 00 00 00 00 00 00 00).  I am not sure what this is, some
>> sort of filler?
>> Bytes 8-11 (00 00 00 16) seems to be some length field?  Decimal 22, which
>> seems to correspond to the length of the entire message, but more than
>> 1+4+n than suggested by the documentation
>> Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal
>> 3916773564).  No problem here.
>> Bytes 16-17 (00 00) not sure what this is.
>> Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”?  But
>> that should be just one byte.  Must be something else?
>> Bytes 22-25 (00 00  00 08) is the message payload size (8), this is the
>> value of “n” in the formula for message length, exactly the length of the
>> “message1” string. No problem here.
>> Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1).
>> No problem here.
>> 
>> Can anyone on the list help me reconcile the documentation to what I see
>> on the disk?  Specifically:
>> a) what are the first 8 bytes supposed to represent?
>> b) the message length field as described as 1+4+n doesn’t correspond with
>> what I see on disk.  It looks like 4 (crc) + 2 (??) + 4 (?magic number?) +
>> 4 (payload length) + 8 (n).  What is the correct formula?
>> c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t
>> the magic value appear before the CRC?
>> d) what is the way to interpret bytes 16-21?  is the magic number in here
>> somewhere?  What else is in this set of bytes?
>> 
>> Thanks
>> sgg
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message