samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rayman preet <rayman7...@gmail.com>
Subject Re: Tracing the Samza+YARN startup process
Date Fri, 31 May 2019 22:07:46 GMT
Apart from /etc/hosts and /bin/hostname the only other relevant place might
be
to modify values in /etc/resolv.conf, to point to, e.g., a dnsmasq instance.

On Fri, May 31, 2019 at 2:43 PM Malcolm McFarland <mmcfarland@cavulus.com>
wrote:

> Hey Rayman,
>
> The ops group and I went through the configuration today and observed the
> YARN containers as they were coming up. We seem to have found the root of
> the problem, and I'm putting this out there for anybody else that's trying
> to do something similar on AWS ECS:
>
> The ECS container instances set their hostname to the container ID on
> startup (ie 717b6f75aaf8), and this looks like it's interfering with the
> YARN container startup process. This *seems* to be corroborated in that
> containers that start on the same host as their AM look to be starting fine
> (ie they can locally resolve their IP address correctly), but containers
> starting on other hosts don't seem to be. We were *not* having this problem
> on Fargate, and my only guess is that, given Fargate's intended use case as
> a replicated-services-in-the-cloud environment, AWS sets the hostname for
> Fargate-bound Docker containers on launch (ie
> ip-10-#-#-#.us-west-#.internal.local or whatever). (As a side note, we
> probably would have stuck with Fargate and not run into this problem, but
> Fargate instances are only allowed 10GB of disk space, and this wasn't
> enough for YARN's VM requirements.)
>
> I've been fishing around for a way to get Samza to resolve the hostname to
> something more publicly-available. I've thus far tried a) changing the
> /etc/hosts file, and b) replacing the /bin/hostname binary in the container
> with a static script, but neither of these options seem to have an effect
> on Java's DNS resolution. Two further options I can think of are:
>
> - find some place in the Samza configuration where the hostname can be set
> explicitly; or
> - change just the right piece of information in the system so that
> java.net.InetAddress will resolve the localhost to something other than
> what's returned from /bin/hostname (I'm guessing it uses gethostname() on
> Ubuntu, could be wrong).
>
> Anybody ideas?
>
> Cheers,
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>
> Malcolm McFarland
> Cavulus
>
>
> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> unauthorized or improper disclosure, copying, distribution, or use of the
> contents of this message is prohibited. The information contained in this
> message is intended only for the personal and confidential use of the
> recipient(s) named above. If you have received this message in error,
> please notify the sender immediately and delete the original message.
>
>
> On Fri, May 31, 2019 at 9:27 AM rayman preet <rayman7718@gmail.com> wrote:
>
> > Yes I think your hunch is right. Each container queries the AM over HTTP
> to
> > obtain
> > the jobModel that it is supposed to run. The AM runs a HTTP server
> usually
> > on
> > a dynamically allocated free port on the machine it's running on.
> > So its possible that a firewall rule blocks the container when it tries
> to
> > reach this port
> > on the AM's machine?
> >
> > --
> > thanks
> > rayman
> >
> > On Thu, May 30, 2019 at 5:30 PM Malcolm McFarland <
> mmcfarland@cavulus.com>
> > wrote:
> >
> > > Thanks for the image, appreciate you taking the effort to do that! I'm
> > > still hitting this wall. The AM will launch the container, the
> container
> > > will go from "accepted" to "running", but there will be no output from
> > the
> > > container (I'm piping all of the Samza, org.apache, org.kafka, and our
> > own
> > > application's logging output to a Kafka topic). During these periods,
> the
> > > container will hang out at ~100MB/8GB memory usage and stall. There's
> no
> > > error output when this happens; it just kind of stops. My suspicion is
> > that
> > > our Ops group has a firewall rule up that's interfering with this,or
> > maybe
> > > just isn't white-listing a port correctly, and if I could identify
> where
> > > the application is stalling, it'd probably help to narrow down the
> > > possibilities.
> > >
> > > Cheers,
> > > Malcolm McFarland
> > > Cavulus
> > >
> > >
> > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > unauthorized or improper disclosure, copying, distribution, or use of
> the
> > > contents of this message is prohibited. The information contained in
> this
> > > message is intended only for the personal and confidential use of the
> > > recipient(s) named above. If you have received this message in error,
> > > please notify the sender immediately and delete the original message.
> > >
> > >
> > > On Thu, May 30, 2019 at 1:39 PM rayman preet <rayman7718@gmail.com>
> > wrote:
> > >
> > > > I uploaded the image here:
> > > > https://www.dropbox.com/s/rv57v165ysp12c5/samza%20flow.png?dl=0
> > > >
> > > > Are you still running into this issue?
> > > > Is there anything in the container's log that shows any
> > > exceptions/errors.
> > > >
> > > > On Wed, May 22, 2019 at 10:15 PM Malcolm McFarland <
> > > mmcfarland@cavulus.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hey rayman,
> > > > >
> > > > > What it looks like is that the AM has started, the container has
> > > started,
> > > > > but, ie, here will be the last messages I see in the Samza logs:
> > > > >
> > > > > 2019-05-23T05:10:45.048Z        INFO    Making a request for
> ANY_HOST
> > > > > 2019-05-23T05:10:45.057Z        INFO    Starting the container
> > > allocator
> > > > > thread
> > > > > 2019-05-23T05:10:47.098Z        INFO    Received new token for :
> > > > > <valid_host>:8032
> > > > > 2019-05-23T05:10:47.102Z        INFO    Container allocated from
RM
> > on
> > > > > <same_valid_host>
> > > > > 2019-05-23T05:10:47.105Z        INFO    Container allocated from
RM
> > on
> > > > > <same_valid_host>
> > > > >
> > > > > At this point, it seems to stall, and no more output is produced.
> > > > >
> > > > > Also, I couldn't see you diagram (it's possible my company's email
> > > > filters
> > > > > attachments); can I see that on the web anywhere?
> > > > >
> > > > > Cheers,
> > > > > Malcolm
> > > > >
> > > > > On Wed, May 22, 2019 at 4:30 PM rayman preet <rayman7718@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi Malcolm,
> > > > > >
> > > > > > This figure (attached) gives an overview of the flow. Is
> > > > > > this something you were looking for?
> > > > > >
> > > > > > Also, by "don't fully start up" do you mean that
> > > > > > applications are missing some containers (but the
> ApplicationMaster
> > > is
> > > > > > running)?
> > > > > > Or the application is missing entirely.
> > > > > >
> > > > > > --
> > > > > > thanks
> > > > > > rayman
> > > > > > [image: Samza Job Launch Sequence.png]
> > > > > >
> > > > > > On Tue, May 21, 2019 at 3:58 PM Malcolm McFarland <
> > > > > mmcfarland@cavulus.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Hey Folks,
> > > > > >>
> > > > > >> I'm still trying to pin down why these applications are
> sometimes
> > > not
> > > > > >> starting. Everything looks fine in the YARN web UI and in
the
> > > > > >> immediately available logs, but the applications don't always
> > fully
> > > > > >> start up. Does anybody have a rundown about how to trace
the
> Samza
> > > > > >> startup process on a YARN cluster, from Accepted status,
to
> > > > > >> localization, to the application master startup, to the
actual
> > > > > >> application's startup?
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Malcolm
> > > > > >>
> > > > > >> --
> > > > > >> Malcolm McFarland
> > > > > >> Cavulus
> > > > > >>
> > > > > >>
> > > > > >> This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus.
> Any
> > > > > >> unauthorized or improper disclosure, copying, distribution,
or
> use
> > > of
> > > > > >> the contents of this message is prohibited. The information
> > > contained
> > > > > >> in this message is intended only for the personal and
> confidential
> > > use
> > > > > >> of the recipient(s) named above. If you have received this
> message
> > > in
> > > > > >> error, please notify the sender immediately and delete the
> > original
> > > > > >> message.
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > thanks
> > > > > > rayman
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Malcolm McFarland
> > > > > Cavulus
> > > > > 1-800-760-6915
> > > > > mmcfarland@cavulus.com
> > > > >
> > > > >
> > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
> > > > > unauthorized or improper disclosure, copying, distribution, or use
> of
> > > the
> > > > > contents of this message is prohibited. The information contained
> in
> > > this
> > > > > message is intended only for the personal and confidential use of
> the
> > > > > recipient(s) named above. If you have received this message in
> error,
> > > > > please notify the sender immediately and delete the original
> message.
> > > > >
> > > >
> > > >
> > > > --
> > > > thanks
> > > > rayman
> > > >
> > >
> >
> >
> > --
> > thanks
> > rayman
> >
>


-- 
thanks
rayman

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message