kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: Consumer Parallelism
Date Tue, 12 Aug 2014 16:22:24 GMT
I see your question now. You may want to read this FAQ:

https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
?


On Tue, Aug 12, 2014 at 8:11 AM, Mingtao Zhang <mail2mingtao@gmail.com>
wrote:

> Hi Guozhang,
>
> I think what I am looking for is the real 'randomness' when producer write
> to the partitions. Based on my log, through a long time period, only one
> partition got the write, while the other side, only one consumer is active.
> In my case the consumer is slow, so when it comes back for the next
> message, the whole pipeline is slowed down.
>
> The 'round robin' works for my case. Is it Email a good thread to follow?
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCANZY1Phu8_QZqHCw-ivGp=MXMZPZE9+2VJXRLewZzEbjoR0Ynw@mail.gmail.com%3E
>
> Best Regards,
> Mingtao
>
>
> On Tue, Aug 12, 2014 at 7:55 AM, Mingtao Zhang <mail2mingtao@gmail.com>
> wrote:
>
> > Hi Guozhang,
> >
> > Thank you!
> >
> > Could I say the consumer 'take turns to consume' is resulted by the
> > correspond partition got the 'message write'?
> >
> > The problem I am facing is my 'enrichment' (getting more data based on
> raw
> > data) consumer took too much time to complete one message consumption. To
> > explore more parallel, could I say my only choice is 'decouple consumer
> > consumption with enrichment'?
> >
> > Mingtao Sent from iPhone
> >
> > > On Aug 12, 2014, at 1:10 AM, Guozhang Wang <wangguoz@gmail.com> wrote:
> > >
> > > Hello Mingtao,
> > >
> > > The partition will not be re-assigned to other consumers unless the
> > current
> > > consumer fails, so the behavior you described will not be expected.
> > >
> > > Guozhang
> > >
> > >
> > > On Mon, Aug 11, 2014 at 6:27 PM, Mingtao Zhang <mail2mingtao@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Guozhang,
> > >>
> > >> I do have another Email talking about Partitions per topic. I paste it
> > >> within this Email.
> > >>
> > >> I am expecting those consumers will work concurrently. The behavior I
> > >> observed here is consumer thread-1 will work a while, then thread-3
> will
> > >> work, then thread-0 ..., is it normal?
> > >>
> > >> version is 2.2.0.
> > >>
> > >> Best Regards,
> > >> Mingtao
> > >>
> > >>> On Wed, Jul 23, 2014 at 7:57 PM, Guozhang Wang <wangguoz@gmail.com>
> > wrote:
> > >>>
> > >>> num.partitions is only used as a default value when the createTopic
> > >> command
> > >>> does not specify the num.partitions or it is automatically created.
> In
> > >> your
> > >>> case since you always use its value in the createTopic you will
> always
> > >> can
> > >>> one partition. Try change your code to sth. like:
> > >>>
> > >>>        String[] args = new String[]{
> > >>>            "--zookeeper", config.getString("zookeeper"),
> > >>>            "--topic", config.getString("topic"),
> > >>>            "--replica", config.getString("replicas"),
> > >>>            "--partition", "8"
> > >>>        };
> > >>>
> > >>>        CreateTopicCommand.main(args);
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Jul 23, 2014 at 4:38 PM, Mingtao Zhang <
> mail2mingtao@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi All,
> > >>>>
> > >>>> In kafka.properties, I put (forgot to change):
> > >>>>
> > >>>> num.partitions=1
> > >>>>
> > >>>> While I create topics programatically:
> > >>>>
> > >>>>        String[] args = new String[]{
> > >>>>            "--zookeeper", config.getString("zookeeper"),
> > >>>>            "--topic", config.getString("topic"),
> > >>>>            "--replica", config.getString("replicas"),
> > >>>>            "--partition", config.getString("partitions")
> > >>>>        };
> > >>>>
> > >>>>        CreateTopicCommand.main(args);
> > >>>>
> > >>>> The performance engineer told me only one consumer thread is
> actively
> > >>>> working even I have 4 consumer threads started (could see when
> > >> debugging
> > >>> or
> > >>>> in thread dump); and 4 partitions configured from the args.
> > >>>>
> > >>>> It seems that num.partitions is still controlling the parallelism.
> Do
> > I
> > >>>> need to change this num.partitions accordingly? Could I remove
it?
> > What
> > >>> is
> > >>>> I have different parallel requirement for different topic?
> > >>>>
> > >>>> Thank you in advance!
> > >>>>
> > >>>> Best Regards,
> > >>>> Mingtao
> > >>
> > >>
> > >>> On Mon, Aug 11, 2014 at 7:37 PM, Guozhang Wang <wangguoz@gmail.com>
> > wrote:
> > >>>
> > >>> Mingtao,
> > >>>
> > >>> How many partitions of the consumed topic has? Basically the data is
> > >>> distributed per-partition, and hence if the number of consumers is
> > larger
> > >>> than the number of partitions, some consumers will not get any data.
> > >>>
> > >>> Guozhang
> > >>>
> > >>>
> > >>> On Mon, Aug 11, 2014 at 3:29 PM, Mingtao Zhang <
> mail2mingtao@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Is it anyhow related to the issue?
> > >>>>
> > >>>> WARN No previously checkpointed highwatermark value found for topic
> > RAW
> > >>>> partition 0. Returning 0 as the highwatermark
> > >>>> (kafka.server.HighwaterMarkCheckpoint)
> > >>>>
> > >>>> Mingtao
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> -- Guozhang
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>



-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message