samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Tracing the Samza+YARN startup process
Date Thu, 20 Jun 2019 00:13:16 GMT
Great and detailed report! Really appreciate it!

-Yi

On Tue, Jun 18, 2019 at 2:37 PM Malcolm McFarland <mmcfarland@cavulus.com>
wrote:

> Just want to follow up on this, for anybody that might be trying to do
> something similar.
>
> There are two things that were getting in the way of us using YARN+Samza on
> ECS: 1) YARN needs to be able to resolve its hostname to something that's
> publicly available; and 2) Samza needs to be able to open connections on
> arbitrary ports in the 30000+ range.
>
> Docker confounds each of these in a different way. For the first, Docker's
> hostname inside of the container is an arbitrary hash, and this is what
> java.net.InetAddress will resolve to. I took Rayman's suggestion and used
> dnsmasq to create a local CNAME mapping inside the container, mapping the
> local "hostname" to one that is publicly available. This should work well
> for any Docker-hosted JVM app relying on java.net.InetAddress.
>
> Docker also only allows 100 ports to be publicly exposed, and there is no
> configuration option in Samza to specify what the range of ports will be.
> The way we worked around this on ECS was to create an elastic network
> interface (ENI) for each of the node manager containers. Although I can't
> find any documentation on this, I suspect that Fargate does this by
> default, as the whole point of that service is to bypass the restrictions
> placed on containers running on EC2 instances. With the ENI, we no longer
> had to explicitly expose any ports; all ports will be available if the
> security group allows.
>
> As an aside, you might wonder: why not just run these on Fargate? Well,
> Fargate only allows 10GB of storage (this can be extended a small amount
> via an ephemeral mounted volume but seemingly not enough to satisfy YARN's
> VM requirements).
>
> Hth, and thanks for everybody's patience,
>
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>
>
> On Fri, May 31, 2019 at 3:08 PM rayman preet <rayman7718@gmail.com> wrote:
>
> > Apart from /etc/hosts and /bin/hostname the only other relevant place
> might
> > be
> > to modify values in /etc/resolv.conf, to point to, e.g., a dnsmasq
> > instance.
> >
> > On Fri, May 31, 2019 at 2:43 PM Malcolm McFarland <
> mmcfarland@cavulus.com>
> > wrote:
> >
> > > Hey Rayman,
> > >
> > > The ops group and I went through the configuration today and observed
> the
> > > YARN containers as they were coming up. We seem to have found the root
> of
> > > the problem, and I'm putting this out there for anybody else that's
> > trying
> > > to do something similar on AWS ECS:
> > >
> > > The ECS container instances set their hostname to the container ID on
> > > startup (ie 717b6f75aaf8), and this looks like it's interfering with
> the
> > > YARN container startup process. This *seems* to be corroborated in that
> > > containers that start on the same host as their AM look to be starting
> > fine
> > > (ie they can locally resolve their IP address correctly), but
> containers
> > > starting on other hosts don't seem to be. We were *not* having this
> > problem
> > > on Fargate, and my only guess is that, given Fargate's intended use
> case
> > as
> > > a replicated-services-in-the-cloud environment, AWS sets the hostname
> for
> > > Fargate-bound Docker containers on launch (ie
> > > ip-10-#-#-#.us-west-#.internal.local or whatever). (As a side note, we
> > > probably would have stuck with Fargate and not run into this problem,
> but
> > > Fargate instances are only allowed 10GB of disk space, and this wasn't
> > > enough for YARN's VM requirements.)
> > >
> > > I've been fishing around for a way to get Samza to resolve the hostname
> > to
> > > something more publicly-available. I've thus far tried a) changing the
> > > /etc/hosts file, and b) replacing the /bin/hostname binary in the
> > container
> > > with a static script, but neither of these options seem to have an
> effect
> > > on Java's DNS resolution. Two further options I can think of are:
> > >
> > > - find some place in the Samza configuration where the hostname can be
> > set
> > > explicitly; or
> > > - change just the right piece of information in the system so that
> > > java.net.InetAddress will resolve the localhost to something other than
> > > what's returned from /bin/hostname (I'm guessing it uses gethostname()
> on
> > > Ubuntu, could be wrong).
> > >
> > > Anybody ideas?
> > >
> > > Cheers,
> > > Malcolm McFarland
> > > Cavulus
> > >
> > >
> > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > unauthorized or improper disclosure, copying, distribution, or use of
> the
> > > contents of this message is prohibited. The information contained in
> this
> > > message is intended only for the personal and confidential use of the
> > > recipient(s) named above. If you have received this message in error,
> > > please notify the sender immediately and delete the original message.
> > >
> > > Malcolm McFarland
> > > Cavulus
> > >
> > >
> > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > unauthorized or improper disclosure, copying, distribution, or use of
> the
> > > contents of this message is prohibited. The information contained in
> this
> > > message is intended only for the personal and confidential use of the
> > > recipient(s) named above. If you have received this message in error,
> > > please notify the sender immediately and delete the original message.
> > >
> > >
> > > On Fri, May 31, 2019 at 9:27 AM rayman preet <rayman7718@gmail.com>
> > wrote:
> > >
> > > > Yes I think your hunch is right. Each container queries the AM over
> > HTTP
> > > to
> > > > obtain
> > > > the jobModel that it is supposed to run. The AM runs a HTTP server
> > > usually
> > > > on
> > > > a dynamically allocated free port on the machine it's running on.
> > > > So its possible that a firewall rule blocks the container when it
> tries
> > > to
> > > > reach this port
> > > > on the AM's machine?
> > > >
> > > > --
> > > > thanks
> > > > rayman
> > > >
> > > > On Thu, May 30, 2019 at 5:30 PM Malcolm McFarland <
> > > mmcfarland@cavulus.com>
> > > > wrote:
> > > >
> > > > > Thanks for the image, appreciate you taking the effort to do that!
> > I'm
> > > > > still hitting this wall. The AM will launch the container, the
> > > container
> > > > > will go from "accepted" to "running", but there will be no output
> > from
> > > > the
> > > > > container (I'm piping all of the Samza, org.apache, org.kafka, and
> > our
> > > > own
> > > > > application's logging output to a Kafka topic). During these
> periods,
> > > the
> > > > > container will hang out at ~100MB/8GB memory usage and stall.
> There's
> > > no
> > > > > error output when this happens; it just kind of stops. My suspicion
> > is
> > > > that
> > > > > our Ops group has a firewall rule up that's interfering with
> this,or
> > > > maybe
> > > > > just isn't white-listing a port correctly, and if I could identify
> > > where
> > > > > the application is stalling, it'd probably help to narrow down the
> > > > > possibilities.
> > > > >
> > > > > Cheers,
> > > > > Malcolm McFarland
> > > > > Cavulus
> > > > >
> > > > >
> > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > > > unauthorized or improper disclosure, copying, distribution, or use
> of
> > > the
> > > > > contents of this message is prohibited. The information contained
> in
> > > this
> > > > > message is intended only for the personal and confidential use of
> the
> > > > > recipient(s) named above. If you have received this message in
> error,
> > > > > please notify the sender immediately and delete the original
> message.
> > > > >
> > > > >
> > > > > On Thu, May 30, 2019 at 1:39 PM rayman preet <rayman7718@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I uploaded the image here:
> > > > > > https://www.dropbox.com/s/rv57v165ysp12c5/samza%20flow.png?dl=0
> > > > > >
> > > > > > Are you still running into this issue?
> > > > > > Is there anything in the container's log that shows any
> > > > > exceptions/errors.
> > > > > >
> > > > > > On Wed, May 22, 2019 at 10:15 PM Malcolm McFarland <
> > > > > mmcfarland@cavulus.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey rayman,
> > > > > > >
> > > > > > > What it looks like is that the AM has started, the container
> has
> > > > > started,
> > > > > > > but, ie, here will be the last messages I see in the Samza
> logs:
> > > > > > >
> > > > > > > 2019-05-23T05:10:45.048Z        INFO    Making a request
for
> > > ANY_HOST
> > > > > > > 2019-05-23T05:10:45.057Z        INFO    Starting the container
> > > > > allocator
> > > > > > > thread
> > > > > > > 2019-05-23T05:10:47.098Z        INFO    Received new token
for
> :
> > > > > > > <valid_host>:8032
> > > > > > > 2019-05-23T05:10:47.102Z        INFO    Container allocated
> from
> > RM
> > > > on
> > > > > > > <same_valid_host>
> > > > > > > 2019-05-23T05:10:47.105Z        INFO    Container allocated
> from
> > RM
> > > > on
> > > > > > > <same_valid_host>
> > > > > > >
> > > > > > > At this point, it seems to stall, and no more output is
> produced.
> > > > > > >
> > > > > > > Also, I couldn't see you diagram (it's possible my company's
> > email
> > > > > > filters
> > > > > > > attachments); can I see that on the web anywhere?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Malcolm
> > > > > > >
> > > > > > > On Wed, May 22, 2019 at 4:30 PM rayman preet <
> > rayman7718@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Malcolm,
> > > > > > > >
> > > > > > > > This figure (attached) gives an overview of the flow.
Is
> > > > > > > > this something you were looking for?
> > > > > > > >
> > > > > > > > Also, by "don't fully start up" do you mean that
> > > > > > > > applications are missing some containers (but the
> > > ApplicationMaster
> > > > > is
> > > > > > > > running)?
> > > > > > > > Or the application is missing entirely.
> > > > > > > >
> > > > > > > > --
> > > > > > > > thanks
> > > > > > > > rayman
> > > > > > > > [image: Samza Job Launch Sequence.png]
> > > > > > > >
> > > > > > > > On Tue, May 21, 2019 at 3:58 PM Malcolm McFarland
<
> > > > > > > mmcfarland@cavulus.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hey Folks,
> > > > > > > >>
> > > > > > > >> I'm still trying to pin down why these applications
are
> > > sometimes
> > > > > not
> > > > > > > >> starting. Everything looks fine in the YARN web
UI and in
> the
> > > > > > > >> immediately available logs, but the applications
don't
> always
> > > > fully
> > > > > > > >> start up. Does anybody have a rundown about how
to trace the
> > > Samza
> > > > > > > >> startup process on a YARN cluster, from Accepted
status, to
> > > > > > > >> localization, to the application master startup,
to the
> actual
> > > > > > > >> application's startup?
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >> Malcolm
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Malcolm McFarland
> > > > > > > >> Cavulus
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> This correspondence is from HealthPlanCRM, LLC,
d/b/a
> Cavulus.
> > > Any
> > > > > > > >> unauthorized or improper disclosure, copying,
distribution,
> or
> > > use
> > > > > of
> > > > > > > >> the contents of this message is prohibited. The
information
> > > > > contained
> > > > > > > >> in this message is intended only for the personal
and
> > > confidential
> > > > > use
> > > > > > > >> of the recipient(s) named above. If you have received
this
> > > message
> > > > > in
> > > > > > > >> error, please notify the sender immediately and
delete the
> > > > original
> > > > > > > >> message.
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > thanks
> > > > > > > > rayman
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Malcolm McFarland
> > > > > > > Cavulus
> > > > > > > 1-800-760-6915
> > > > > > > mmcfarland@cavulus.com
> > > > > > >
> > > > > > >
> > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus.
> > Any
> > > > > > > unauthorized or improper disclosure, copying, distribution,
or
> > use
> > > of
> > > > > the
> > > > > > > contents of this message is prohibited. The information
> contained
> > > in
> > > > > this
> > > > > > > message is intended only for the personal and confidential
use
> of
> > > the
> > > > > > > recipient(s) named above. If you have received this message
in
> > > error,
> > > > > > > please notify the sender immediately and delete the original
> > > message.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > thanks
> > > > > > rayman
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > thanks
> > > > rayman
> > > >
> > >
> >
> >
> > --
> > thanks
> > rayman
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message