kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Graham <sggraha...@gmail.com>
Subject Trying to understand the format of the LogSegment file.
Date Thu, 03 Dec 2015 17:18:45 GMT
I am attempting to understand the details of the content of the log segment file in Kafka.

The documentation (http://kafka.apache.org/081/documentation.html#log)  suggests:
The exact binary format for messages is versioned and maintained as a standard interface so
message sets can be transfered between producer, broker, and client without recopying or conversion
when desirable. This format is as follows:

On-disk format of a message

message length : 4 bytes (value: 1+4+n) 
"magic" value  : 1 byte
crc            : 4 bytes
payload        : n bytes


But I am struggling to map the documentation to what I see on the disk.

I created a topic, named simple-topic, and added one message to it (via the console producer).
 The message payload was “message1”.

The DumpLogSegments tool shows:
Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec: NoCompressionCodec
crc: 3916773564

Taking a hex dump of the (only) log file:
sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more
00000000  00 00 00 00 00 00 00 00  00 00 00 16 e9 75 38 bc  |.............u8.|
00000010  00 00 ff ff ff ff 00 00  00 08 6d 65 73 73 61 67  |..........messag|
00000020  65 31                                             |e1|
00000022

I tried to “reverse engineer” the contents, to see how it corresponds to the documentation:

Bytes 0-7 (00 00 00 00 00 00 00 00).  I am not sure what this is, some sort of filler?
Bytes 8-11 (00 00 00 16) seems to be some length field?  Decimal 22, which seems to correspond
to the length of the entire message, but more than 1+4+n than suggested by the documentation
Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal 3916773564).  No problem here.
Bytes 16-17 (00 00) not sure what this is.
Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”?  But that should be
just one byte.  Must be something else?
Bytes 22-25 (00 00  00 08) is the message payload size (8), this is the value of “n” in
the formula for message length, exactly the length of the “message1” string. No problem
here.
Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1).  No problem here.

Can anyone on the list help me reconcile the documentation to what I see on the disk?  Specifically:
a) what are the first 8 bytes supposed to represent?  
b) the message length field as described as 1+4+n doesn’t correspond with what I see on
disk.  It looks like 4 (crc) + 2 (??) + 4 (?magic number?) + 4 (payload length) + 8 (n). 
What is the correct formula?
c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t the magic value
appear before the CRC?
d) what is the way to interpret bytes 16-21?  is the magic number in here somewhere?  What
else is in this set of bytes?

Thanks
sgg


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message