kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Edenhill <mag...@edenhill.se>
Subject Re: Trying to understand the format of the LogSegment file.
Date Thu, 03 Dec 2015 21:14:42 GMT
Hi,

messages are stored on disk in the Kafka (network) protocol format, so if
you have a look at the protocol guide you'll see the pieces start coming
together:
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets

Regards,
Magnus



2015-12-03 18:18 GMT+01:00 Steve Graham <sggraham64@gmail.com>:

> I am attempting to understand the details of the content of the log
> segment file in Kafka.
>
> The documentation (http://kafka.apache.org/081/documentation.html#log)
> suggests:
> The exact binary format for messages is versioned and maintained as a
> standard interface so message sets can be transfered between producer,
> broker, and client without recopying or conversion when desirable. This
> format is as follows:
>
> On-disk format of a message
>
> message length : 4 bytes (value: 1+4+n)
> "magic" value  : 1 byte
> crc            : 4 bytes
> payload        : n bytes
>
>
> But I am struggling to map the documentation to what I see on the disk.
>
> I created a topic, named simple-topic, and added one message to it (via
> the console producer).  The message payload was “message1”.
>
> The DumpLogSegments tool shows:
> Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log
> Starting offset: 0
> offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec:
> NoCompressionCodec crc: 3916773564
>
> Taking a hex dump of the (only) log file:
> sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more
> 00000000  00 00 00 00 00 00 00 00  00 00 00 16 e9 75 38 bc
> |.............u8.|
> 00000010  00 00 ff ff ff ff 00 00  00 08 6d 65 73 73 61 67
> |..........messag|
> 00000020  65 31                                             |e1|
> 00000022
>
> I tried to “reverse engineer” the contents, to see how it corresponds to
> the documentation:
>
> Bytes 0-7 (00 00 00 00 00 00 00 00).  I am not sure what this is, some
> sort of filler?
> Bytes 8-11 (00 00 00 16) seems to be some length field?  Decimal 22, which
> seems to correspond to the length of the entire message, but more than
> 1+4+n than suggested by the documentation
> Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal
> 3916773564).  No problem here.
> Bytes 16-17 (00 00) not sure what this is.
> Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”?  But
> that should be just one byte.  Must be something else?
> Bytes 22-25 (00 00  00 08) is the message payload size (8), this is the
> value of “n” in the formula for message length, exactly the length of the
> “message1” string. No problem here.
> Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1).
> No problem here.
>
> Can anyone on the list help me reconcile the documentation to what I see
> on the disk?  Specifically:
> a) what are the first 8 bytes supposed to represent?
> b) the message length field as described as 1+4+n doesn’t correspond with
> what I see on disk.  It looks like 4 (crc) + 2 (??) + 4 (?magic number?) +
> 4 (payload length) + 8 (n).  What is the correct formula?
> c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t
> the magic value appear before the CRC?
> d) what is the way to interpret bytes 16-21?  is the magic number in here
> somewhere?  What else is in this set of bytes?
>
> Thanks
> sgg
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message