kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rosenberg <...@squareup.com>
Subject Re: [DISCUSSION] adding the serializer api back to the new java producer
Date Wed, 03 Dec 2014 05:23:42 GMT
In our case, we use protocol buffers for all messages, and these have
simple serialization/deserialization builtin to the protobuf libraries
(e.g. MyProtobufMessage.toByteArray()).  Also, we often produce/consume
messages without conversion to/from protobuf Objects (e.g. in cases where
we are just forwarding messages on to other topics, or if we are consuming
directly to a binary blob store like hdfs).  There's a huge efficiency in
not over synthesizing new Objects.

Thus, it's nice to only deal with bytes directly in all messages, and keep
things simple.  Having the overhead of having to dummy in a default,
generically parameterized, no-op serializer (and the overhead of having
that extra no-op method call, seems unnecessary).

I'd suggest that maybe it could work seamlessly either way (which it
probably does now, for the case where no serializer is provided, but not
sure if it efficiently will elide the call to the no-op serializer after
JIT?)....Alternatively, I do think it's important to preserve the
efficiency of sending raw bytes directly, so if necessary, maybe expose
both apis (one which explicitly bypasses any serialization).

Finally, I've wondered in the past about enabling some sort of streaming
serialization, whereby you hook up a producer to a long living stream
class, which could integrate compression in line, and allow more control of
the pipeline.  The stream would implement an iterator to get the next
serialized message, etc.  For me, something like this might be a reason to
have a serialization/deserialization abstraction built into the
producer/consumer api's.

But if I have a vote, I'd be in favor of keeping the api simple and have it
take bytes directly.

Jason

On Tue, Dec 2, 2014 at 9:50 PM, Jan Filipiak <Jan.Filipiak@trivago.com>
wrote:

> Hello Everyone,
>
> I would very much appreciate if someone could provide me a real world
> examplewhere it is more convenient to implement the serializers instead of
> just making sure to provide bytearrays.
>
> The code we came up with explicitly avoids the serializer api. I think it
> is common understanding that if you want to transport data you need to have
> it as a bytearray.
>
> If at all I personally would like to have a serializer interface that
> takes the same types as the producer
>
> public interface Serializer<K,V> extends Configurable {
>     public byte[] serializeKey(K data);
>     public byte[] serializeValue(V data);
>     public void close();
> }
>
> this would avoid long serialize implementations with branches like
> "switch(topic)" or "if(isKey)". Further serializer per topic makes more
> sense in my opinion. It feels natural to have a one to one relationship
> from types to topics or at least only a few partition per type. But as we
> inherit the type from the producer we would have to create many producers.
> This would create additional unnecessary connections to the brokers. With
> the serializers we create a one type to all topics relationship and the
> only type that satisfies that is the bytearray or Object. Am I missing
> something here? As said in the beginning I would like to that usecase that
> really benefits from using the serializers. I think in theory they sound
> great but they cause real practical issues that may lead users to wrong
> decisions.
>
> -1 for putting the serializers back in.
>
> Looking forward to replies that can show me the benefit of serializes and
> especially how the
> Type => topic relationship can be handled nicely.
>
> Best
> Jan
>
>
>
>
> On 25.11.2014 02:58, Jun Rao wrote:
>
>> Hi, Everyone,
>>
>> I'd like to start a discussion on whether it makes sense to add the
>> serializer api back to the new java producer. Currently, the new java
>> producer takes a byte array for both the key and the value. While this api
>> is simple, it pushes the serialization logic into the application. This
>> makes it hard to reason about what type of data is being sent to Kafka and
>> also makes it hard to share an implementation of the serializer. For
>> example, to support Avro, the serialization logic could be quite involved
>> since it might need to register the Avro schema in some remote registry
>> and
>> maintain a schema cache locally, etc. Without a serialization api, it's
>> impossible to share such an implementation so that people can easily
>> reuse.
>> We sort of overlooked this implication during the initial discussion of
>> the
>> producer api.
>>
>> So, I'd like to propose an api change to the new producer by adding back
>> the serializer api similar to what we had in the old producer. Specially,
>> the proposed api changes are the following.
>>
>> First, we change KafkaProducer to take generic types K and V for the key
>> and the value, respectively.
>>
>> public class KafkaProducer<K,V> implements Producer<K,V> {
>>
>>      public Future<RecordMetadata> send(ProducerRecord<K,V> record,
>> Callback
>> callback);
>>
>>      public Future<RecordMetadata> send(ProducerRecord<K,V> record);
>> }
>>
>> Second, we add two new configs, one for the key serializer and another for
>> the value serializer. Both serializers will default to the byte array
>> implementation.
>>
>> public class ProducerConfig extends AbstractConfig {
>>
>>      .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
>> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
>> KEY_SERIALIZER_CLASS_DOC)
>>      .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
>> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
>> VALUE_SERIALIZER_CLASS_DOC);
>> }
>>
>> Both serializers will implement the following interface.
>>
>> public interface Serializer<T> extends Configurable {
>>      public byte[] serialize(String topic, T data, boolean isKey);
>>
>>      public void close();
>> }
>>
>> This is more or less the same as what's in the old producer. The slight
>> differences are (1) the serializer now only requires a parameter-less
>> constructor; (2) the serializer has a configure() and a close() method for
>> initialization and cleanup, respectively; (3) the serialize() method
>> additionally takes the topic and an isKey indicator, both of which are
>> useful for things like schema registration.
>>
>> The detailed changes are included in KAFKA-1797. For completeness, I also
>> made the corresponding changes for the new java consumer api as well.
>>
>> Note that the proposed api changes are incompatible with what's in the
>> 0.8.2 branch. However, if those api changes are beneficial, it's probably
>> better to include them now in the 0.8.2 release, rather than later.
>>
>> I'd like to discuss mainly two things in this thread.
>> 1. Do people feel that the proposed api changes are reasonable?
>> 2. Are there any concerns of including the api changes in the 0.8.2 final
>> release?
>>
>> Thanks,
>>
>> Jun
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message