samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Avro vs Protocol buffer for Samza output
Date Thu, 19 Nov 2015 01:58:43 GMT
Yeah, this reduced-overhead message format calls for the need to have an
Avro schema registry s.t. you can lookup the actual Avro schema via the
schemaId.

On Wed, Nov 18, 2015 at 5:53 PM, Selina Tech <swucareer99@gmail.com> wrote:

> Hi, Yi:
>
>     I think I got the answer as below:
>
> "The Kafka message format starts with a magic byte indicating what kind of
> serialization is used for this message. And if this byte indicates Avro,
> you can layout your message as starting with the schemaId and then followed
> by message payload. Upon consumption, you can first get the schemaId, query
> Avro for the schema given the id, and then use schema to deserialize the
> message"
> --http://grokbase.com/t/kafka/users/138mdm6tp3/avro-serialization
>
>
> Thanks again!
> Sincerely,
> Selina
>
> On Wed, Nov 18, 2015 at 5:43 PM, Selina Tech <swucareer99@gmail.com>
> wrote:
>
> > Hi, Yi:
> >      Thanks for your reply. Do you mean there is no advantage of Avro
> > message vs Protocol buffer message on Kafka except  Avro schema registry?
> >
> >      BTW, do you know how Kafka implement the Avro message? Does each
> Avro
> > message include the schema or not?  The size of Avro message is a big
> > concern for me now.
> >
> > Sincerely,
> > Selina
> >
> >
> >
> > On Wed, Nov 18, 2015 at 5:29 PM, Yi Pan <nickpan47@gmail.com> wrote:
> >
> >> Hi, Selina,
> >>
> >> Samza's producer/consumer is highly tunable. You can configure it to use
> >> ProtocolBufferSerde class if your messages in Kafka are in ProtocolBuf
> >> format. The use of Avro in Kafka is LinkedIn's choice and does not
> >> necessarily fit others.
> >>
> >> For the sake of "why LinkedIn uses Avro", here is the biggest reason:
> >> LinkedIn uses Avro schema registry to ensure that producer/consumer are
> >> using compatible Avro schema versions. It is a specific way of
> maintaining
> >> compatibility between producer and consumer in LinkedIn. ProtoBuf does
> not
> >> seem to have the schema registry functionality and requires
> re-compilation
> >> to make sure producer and consumer are compatible on the wire-format of
> >> the
> >> message.
> >>
> >> If you have other ways to maintain the compatibility between producer
> and
> >> consumers using ProtoBuf, I don't see why you cannot use ProtoBuf in
> >> Samza.
> >>
> >> Best,
> >>
> >> -Yi
> >>
> >> On Wed, Nov 18, 2015 at 3:43 PM, Selina Tech <swucareer99@gmail.com>
> >> wrote:
> >>
> >> > Dear All:
> >> >
> >> >       I need to generate some data by Samza to Kafka and then write to
> >> > Parquet formate file.  I was asked why I choose Avro type as my Samza
> >> > output to Kafka instead of Protocol Buffer. Since currently our data
> on
> >> > Kafka are all Protocol buffer.
> >> >       I explained for Avro encoded message -- The encoded size is
> >> smaller,
> >> > no extra code compile, implementation easier.  fast to
> >> > serialize/deserialize and support a lot language.  However some people
> >> > believe when encoded the Avro message take as much space as Protocol
> >> > buffer, but with schema, the size could be much bigger.
> >> >
> >> >       I am wondering if there are any other advantages make you choose
> >> Avro
> >> > as your message type at Kafka?
> >> >
> >> > Sincerely,
> >> > Selina
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message