kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrienne Kole <adrienneko...@gmail.com>
Subject Re: Kafka Streams dynamic partitioning
Date Wed, 05 Oct 2016 20:37:20 GMT
Hi,

@Ali     IMO, Yes. That is the job of kafka server to assign kafka
instances partition(s) to process. Each instance can process more than one
partition but one partition cannot be processed by more than one instance.

@Michael, Thanks for reply.
>Rather, pick the number of partitions in a way that matches your needs to
process the data in parallel
I think this should be ' pick number of partitions that matches max number
of possible keys in stream to be partitioned '.
At least in my usecase , in which I am trying to partition stream by key
and make windowed aggregations, if there are less number of topic
partitions than possible keys,  then application will not work correctly.

That is, if the number of topic partitions is less than possible stream
keys, then different keyed stream tuples will be assigned to same topic.
That was the problem that I was trying to solve and it seems the only
solution is to estimate max number of possible keys and assign accordingly.

Thanks
Adrienne





On Wed, Oct 5, 2016 at 5:55 PM, Ali Akhtar <ali.rac200@gmail.com> wrote:

> > It's often a good
> idea to over-partition your topics.  For example, even if today 10 machines
> (and thus 10 partitions) would be sufficient, pick a higher number of
> partitions (say, 50) so you have some wiggle room to add more machines
> (11...50) later if need be.
>
> If you create e.g 30 partitions, but only have e.g 5 instances of your
> program, all on the same consumer group, all using kafka streams to consume
> the topic, do you still receive all the data posted to the topic, or will
> you need to have the same instances of the program as there are partitions?
>
> (If you have 1 instance, 30 partitions, will the same rules apply, i.e it
> will receive all data?)
>
> On Wed, Oct 5, 2016 at 8:52 PM, Michael Noll <michael@confluent.io> wrote:
>
> > > So, in this case I should know the max number of possible keys so that
> > > I can create that number of partitions.
> >
> > Assuming I understand your original question correctly, then you would
> not
> > need to do/know this.  Rather, pick the number of partitions in a way
> that
> > matches your needs to process the data in parallel (e.g. if you expect
> that
> > you require 10 machines in order to process the incoming data, then you'd
> > need 10 partitions).  Also, as a general recommendation:  It's often a
> good
> > idea to over-partition your topics.  For example, even if today 10
> machines
> > (and thus 10 partitions) would be sufficient, pick a higher number of
> > partitions (say, 50) so you have some wiggle room to add more machines
> > (11...50) later if need be.
> >
> >
> >
> > On Wed, Oct 5, 2016 at 9:34 AM, Adrienne Kole <adriennekole1@gmail.com>
> > wrote:
> >
> > > Hi Guozhang,
> > >
> > > So, in this case I should know the max number of possible keys so that
> I
> > > can create that number of partitions.
> > >
> > > Thanks
> > >
> > > Adrienne
> > >
> > > On Wed, Oct 5, 2016 at 1:00 AM, Guozhang Wang <wangguoz@gmail.com>
> > wrote:
> > >
> > > > By default the partitioner will use murmur hash on the key and mode
> on
> > > > current num.partitions to determine which partitions to go to, so
> > records
> > > > with the same key will be assigned to the same partition. Would that
> be
> > > OK
> > > > for your case?
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Tue, Oct 4, 2016 at 3:00 PM, Adrienne Kole <
> adriennekole1@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > From Streams documentation, I can see that each Streams instance
is
> > > > > processing data independently (from other instances), reads from
> > topic
> > > > > partition(s) and writes to specified topic.
> > > > >
> > > > >
> > > > > So here, the partitions of topic should be determined beforehand
> and
> > > > should
> > > > > remain static.
> > > > > In my usecase I want to create partitioned/keyed (time) windows and
> > > > > aggregate them.
> > > > > I can partition the incoming data to specified topic's partitions
> and
> > > > each
> > > > > Stream instance can do windowed aggregations.
> > > > >
> > > > > However, if I don't know the number of possible keys (to
> partition),
> > > then
> > > > > what should I do?
> > > > >
> > > > > Thanks
> > > > > Adrienne
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message