samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@linkedin.com.INVALID>
Subject Re: Requesting Yarn resources for Kafka/topic/partition + samza/container locality
Date Wed, 17 Dec 2014 00:04:41 GMT
Hey Bart,

We've discussed collocating SamzaContainers with Kafka brokers in:

  https://issues.apache.org/jira/browse/SAMZA-335

This feature is not currently supported in Samza. We were never able to
get YARN to honor host-level resource requests. This, combined with the
subtle complexity of the problem, led us to back off on support for it for
the time being.

In theory, the approach should reduce network IO. In practice, collocation
is heavily dependent on the structure of your topics (partition count),
and where your jobs are writing to. There are details on this in the JIRA
above.

As an aside, even without direct collocation support in Samza, we ran our
YARN NMs on the same machines as the Kafka brokers, which gave us some
serendipitous network IO savings, since containers would randomly end up
on brokers from which they were consuming. Recently, we moved our NMs off
of the boxes because some of our stateful Samza jobs were fighting for
page cache with the Kafka brokers, which are very dependent on page cache.
If your grid is small enough, the "random" approach might be good
enough--if you have just 3 brokers/NMs, putting them on the same 3 boxes
will cut NIC usage by 1/3rd, roughly (assuming an even partition leader
distribution).

Cheers,
Chris

On 12/16/14 2:33 PM, "Bart Wyatt" <bart.wyatt@dsvolition.com> wrote:

>We are in the early evaluating period for Samza in a relatively resource
>constrained environment.  One of the things we cannot currently expect is
>more than a 1 gigabit local network which our models indicate we will
>saturate in a naïve case.
>
>One solution we are considering would be that all of our highest
>throughput jobs, the ones that consume directly from and filter high
>throughput topics, would be co-located on the same nodes running the
>brokers for the applicable partition of those topics.  The idea being we
>would not have to escape loopback to deliver the messages and that the
>output bandwidth of those jobs would be significantly smaller and more
>manageable.
>
>It seems like this is something the ApplicationMaster would have to
>coordinate with YARN and very much resembles how YARN will allocate
>compute resources near HDFS-stored-data.  Is there anything in
>ApplicationMaster that would allow us to do this today?  Or would the
>proper approach be to run those jobs directly outside of a YARN grid and
>have the YARN Jobs read from the products of such direct jobs?
>
>-Bart
>
>
>________________________________
>This e-mail may contain CONFIDENTIAL AND PROPRIETARY INFORMATION and/or
>PRIVILEGED AND CONFIDENTIAL COMMUNICATION intended solely for the
>recipient and, therefore, may not be retransmitted to any party outside
>of the recipient's organization without the prior written consent of the
>sender. If you have received this e-mail in error please notify the
>sender immediately by telephone or reply e-mail and destroy the original
>message without making a copy. Deep Silver Volition, LLC accepts no
>liability for any losses or damages resulting from infected e-mail
>transmissions and viruses in e-mail attachment.


Mime
View raw message