samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm McFarland <mmcfarl...@cavulus.com>
Subject Re: Tracing the Samza+YARN startup process
Date Fri, 31 May 2019 21:42:53 GMT
Hey Rayman,

The ops group and I went through the configuration today and observed the
YARN containers as they were coming up. We seem to have found the root of
the problem, and I'm putting this out there for anybody else that's trying
to do something similar on AWS ECS:

The ECS container instances set their hostname to the container ID on
startup (ie 717b6f75aaf8), and this looks like it's interfering with the
YARN container startup process. This *seems* to be corroborated in that
containers that start on the same host as their AM look to be starting fine
(ie they can locally resolve their IP address correctly), but containers
starting on other hosts don't seem to be. We were *not* having this problem
on Fargate, and my only guess is that, given Fargate's intended use case as
a replicated-services-in-the-cloud environment, AWS sets the hostname for
Fargate-bound Docker containers on launch (ie
ip-10-#-#-#.us-west-#.internal.local or whatever). (As a side note, we
probably would have stuck with Fargate and not run into this problem, but
Fargate instances are only allowed 10GB of disk space, and this wasn't
enough for YARN's VM requirements.)

I've been fishing around for a way to get Samza to resolve the hostname to
something more publicly-available. I've thus far tried a) changing the
/etc/hosts file, and b) replacing the /bin/hostname binary in the container
with a static script, but neither of these options seem to have an effect
on Java's DNS resolution. Two further options I can think of are:

- find some place in the Samza configuration where the hostname can be set
explicitly; or
- change just the right piece of information in the system so that
java.net.InetAddress will resolve the localhost to something other than
what's returned from /bin/hostname (I'm guessing it uses gethostname() on
Ubuntu, could be wrong).

Anybody ideas?

Cheers,
Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.

Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.


On Fri, May 31, 2019 at 9:27 AM rayman preet <rayman7718@gmail.com> wrote:

> Yes I think your hunch is right. Each container queries the AM over HTTP to
> obtain
> the jobModel that it is supposed to run. The AM runs a HTTP server usually
> on
> a dynamically allocated free port on the machine it's running on.
> So its possible that a firewall rule blocks the container when it tries to
> reach this port
> on the AM's machine?
>
> --
> thanks
> rayman
>
> On Thu, May 30, 2019 at 5:30 PM Malcolm McFarland <mmcfarland@cavulus.com>
> wrote:
>
> > Thanks for the image, appreciate you taking the effort to do that! I'm
> > still hitting this wall. The AM will launch the container, the container
> > will go from "accepted" to "running", but there will be no output from
> the
> > container (I'm piping all of the Samza, org.apache, org.kafka, and our
> own
> > application's logging output to a Kafka topic). During these periods, the
> > container will hang out at ~100MB/8GB memory usage and stall. There's no
> > error output when this happens; it just kind of stops. My suspicion is
> that
> > our Ops group has a firewall rule up that's interfering with this,or
> maybe
> > just isn't white-listing a port correctly, and if I could identify where
> > the application is stalling, it'd probably help to narrow down the
> > possibilities.
> >
> > Cheers,
> > Malcolm McFarland
> > Cavulus
> >
> >
> > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > unauthorized or improper disclosure, copying, distribution, or use of the
> > contents of this message is prohibited. The information contained in this
> > message is intended only for the personal and confidential use of the
> > recipient(s) named above. If you have received this message in error,
> > please notify the sender immediately and delete the original message.
> >
> >
> > On Thu, May 30, 2019 at 1:39 PM rayman preet <rayman7718@gmail.com>
> wrote:
> >
> > > I uploaded the image here:
> > > https://www.dropbox.com/s/rv57v165ysp12c5/samza%20flow.png?dl=0
> > >
> > > Are you still running into this issue?
> > > Is there anything in the container's log that shows any
> > exceptions/errors.
> > >
> > > On Wed, May 22, 2019 at 10:15 PM Malcolm McFarland <
> > mmcfarland@cavulus.com
> > > >
> > > wrote:
> > >
> > > > Hey rayman,
> > > >
> > > > What it looks like is that the AM has started, the container has
> > started,
> > > > but, ie, here will be the last messages I see in the Samza logs:
> > > >
> > > > 2019-05-23T05:10:45.048Z        INFO    Making a request for ANY_HOST
> > > > 2019-05-23T05:10:45.057Z        INFO    Starting the container
> > allocator
> > > > thread
> > > > 2019-05-23T05:10:47.098Z        INFO    Received new token for :
> > > > <valid_host>:8032
> > > > 2019-05-23T05:10:47.102Z        INFO    Container allocated from RM
> on
> > > > <same_valid_host>
> > > > 2019-05-23T05:10:47.105Z        INFO    Container allocated from RM
> on
> > > > <same_valid_host>
> > > >
> > > > At this point, it seems to stall, and no more output is produced.
> > > >
> > > > Also, I couldn't see you diagram (it's possible my company's email
> > > filters
> > > > attachments); can I see that on the web anywhere?
> > > >
> > > > Cheers,
> > > > Malcolm
> > > >
> > > > On Wed, May 22, 2019 at 4:30 PM rayman preet <rayman7718@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Malcolm,
> > > > >
> > > > > This figure (attached) gives an overview of the flow. Is
> > > > > this something you were looking for?
> > > > >
> > > > > Also, by "don't fully start up" do you mean that
> > > > > applications are missing some containers (but the ApplicationMaster
> > is
> > > > > running)?
> > > > > Or the application is missing entirely.
> > > > >
> > > > > --
> > > > > thanks
> > > > > rayman
> > > > > [image: Samza Job Launch Sequence.png]
> > > > >
> > > > > On Tue, May 21, 2019 at 3:58 PM Malcolm McFarland <
> > > > mmcfarland@cavulus.com>
> > > > > wrote:
> > > > >
> > > > >> Hey Folks,
> > > > >>
> > > > >> I'm still trying to pin down why these applications are sometimes
> > not
> > > > >> starting. Everything looks fine in the YARN web UI and in the
> > > > >> immediately available logs, but the applications don't always
> fully
> > > > >> start up. Does anybody have a rundown about how to trace the
Samza
> > > > >> startup process on a YARN cluster, from Accepted status, to
> > > > >> localization, to the application master startup, to the actual
> > > > >> application's startup?
> > > > >>
> > > > >> Cheers,
> > > > >> Malcolm
> > > > >>
> > > > >> --
> > > > >> Malcolm McFarland
> > > > >> Cavulus
> > > > >>
> > > > >>
> > > > >> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus.
Any
> > > > >> unauthorized or improper disclosure, copying, distribution, or
use
> > of
> > > > >> the contents of this message is prohibited. The information
> > contained
> > > > >> in this message is intended only for the personal and confidential
> > use
> > > > >> of the recipient(s) named above. If you have received this message
> > in
> > > > >> error, please notify the sender immediately and delete the
> original
> > > > >> message.
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > thanks
> > > > > rayman
> > > > >
> > > >
> > > >
> > > > --
> > > > Malcolm McFarland
> > > > Cavulus
> > > > 1-800-760-6915
> > > > mmcfarland@cavulus.com
> > > >
> > > >
> > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > > unauthorized or improper disclosure, copying, distribution, or use of
> > the
> > > > contents of this message is prohibited. The information contained in
> > this
> > > > message is intended only for the personal and confidential use of the
> > > > recipient(s) named above. If you have received this message in error,
> > > > please notify the sender immediately and delete the original message.
> > > >
> > >
> > >
> > > --
> > > thanks
> > > rayman
> > >
> >
>
>
> --
> thanks
> rayman
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message