samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davide Simoncelli <netcelli....@gmail.com>
Subject Re: Number of partitions
Date Fri, 22 May 2015 07:50:57 GMT
Garry,

Thanks for sharing. That was an interesting read!

I Have few questions for you. How did you test your needed throughput? Only at Kafka level
or with your Samza job as well? Did you measure it with metrics?

Thanks

Davide

> On 21 May 2015, at 8:50 pm, Garry Turkington <g.turkington@improvedigital.com>
wrote:
> 
> Hi,
> 
> The other variable to think about here is the task to container mapping. Each job will
indeed have 1 task per input partition in the underlying topic but you can then spread those
500 instances across multiple containers in your Yarn grid:
> 
> http://samza.apache.org/learn/documentation/0.9/container/samza-container.html
> 
> I'd also suggest thinking about throughput requirements in terms of both the Kafka and
Samza perspectives. Great blog post from one of the Confluent guys here:
> 
> http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
> 
> For me I have Kafka brokers with JBOD disks so my starting partition count was number
of brokers * number of disks. I went with that, discovered it wasn't hitting my needed throughput
and had to up that several times. Currently I have around 160 partitions for the  high throughput
topics (700K msgs/sec) on a 5 broker cluster.
> 
> So over-partitioning is a good thing and gives additional throughput and more flexible
growth but if you are ramping that too soon you are likely requiring growth of your job container
count. Look at your throughput requirements and hardware assets, pick a starting point and
test your assumptions. If you are  like me you'll find most of them are wrong. :)
> 
> Garry
> 
> -----Original Message-----
> From: Lukas Steiblys [mailto:lukas@doubledutch.me] 
> Sent: 21 May 2015 20:20
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
> 
> Each job will get all the partitions and each task (500 of them) within the job will
get 1 partition. So there will be 500 processes working through the log.
> 
> I'd try to figure out what your scaling needs are for the next 2-3 years and then calculate
your resource requirements accordingly (how many parallel executing tasks you would need).
If you need to split, it is not trivial, but doable.
> 
> Lukas
> 
> -----Original Message-----
> From: Michael Ravits
> Sent: Thursday, May 21, 2015 11:17 AM
> To: dev@samza.apache.org
> Subject: Re: Number of partitions
> 
> Well, since the number of partitions can't be changed after the system starts running
I wanted to have the flexibility to grow a lot without stopping for upgrade.
> Just wonder what would be a tolerable number for Samza.
> For example if I'd start with 5 jobs, each will get 100 partitions. Is this reasonable?
Or too much for a single job instance?
> 
> On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lukas@doubledutch.me>
> wrote:
> 
>> 500 is a bit extreme unless you're planning on running the job on some 
>> 200 machines and try to exploit their full power. I personally run 4 
>> in production for our system processing 100 messages/s and there's 
>> plenty of room to grow.
>> 
>> Lukas
>> 
>> On Thursday, May 21, 2015, Michael Ravits <michaelr524@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I wonder what are the considerations I need to account for in regard 
>>> to
>> the
>>> number of partitions in input topics for Samza.
>>> When testing with a 500 partitions topic with one Samza job I 
>>> noticed the start up time to be very long.
>>> Are there any problems that might occur when dealing with this 
>>> number of partitions?
>>> 
>>> Thanks,
>>> Michael
>>> 
>> 
> 
> 
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15


Mime
View raw message