samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Sannier <ASann...@helixeducation.com>
Subject Re: Samza closing and re-opening kafka connection rapidly, cannot consume or produce, no useful logs
Date Tue, 31 Mar 2015 21:59:55 GMT
Thanks so much for getting back to me, Chris.

I’ve attached the AM log from my most recent attempt to run the
hello-samza wikipedia-feed task. I’ve been using pretty small nodes to
keep costs down while I test and so forth, so that makes a lot of sense
(though I definitely hoped I’d configured appropriate memory ceilings).
Here are the values from the YARN UI:

Containers Running: 1
Memory Used: 1 GB
  Memory Total: 1.76 GB
  Memory Reserved: 0 B
  VCores Used: 1
  VCores Total: 8
  VCores Reserved: 0
Active Nodes: 1
Decommissioned Nodes: 0
Lost Nodes: 0
Unhealthy Nodes: 0
Rebooted Nodes: 0


Again, much obliged for your response.

Andrew Sannier



On 3/31/15, 3:54 PM, "Chris Riccomini" <criccomini@apache.org> wrote:

>Hey Andrew,
>
>I'm wondering if your YARN cluster doesn't have enough memory to fit both
>the AM and its containers. The fact that the AM UI shows no running
>containers is suspicious. Can you check these four settings in your YARN
>RM's UI:
>
>  Memory Used
>  Memory Total
>  Memory Reserved
>  VCores Used
>  VCores Total
>  VCores Reserved
>
>Can you also attach (or post to gist/pastebin/etc) the YARN AM's full log?
>
>Cheers,
>Chris
>
>On Tue, Mar 31, 2015 at 2:32 PM, Andrew Sannier
><ASannier@helixeducation.com
>> wrote:
>
>> Something to add here: there are a couple of weird things in the Samza
>> Application Master web UI: Application master task ID is -1, which seems
>> odd, and the Running Containers table is completely empty. How could
>>YARN
>> call a task “Running” if there’s no container?
>>
>> Thanks,
>> Andrew Sannier
>>
>>
>>
>>
>>
>> On 3/31/15, 2:19 PM, "Andrew Sannier" <ASannier@helixeducation.com>
>>wrote:
>>
>> >Hi all -
>> >
>> >Thanks in advance for your help; I have been totally stuck on this for
>>a
>> >couple of days.
>> >
>> >I have a small YARN cluster with one ResourceManager and one
>>NodeManager
>> >as well as one Zookeeper node and one Kafka node - trying to keep the
>> >number of moving parts to a minimum. I¹ve been following the guide to
>> >running Samza on YARN
>> 
>>>(https://samza.apache.org/learn/tutorials/0.8/run-in-multi-node-yarn.htm
>>>l
>> )
>> >,
>> > and I get to the end of the tutorial with a Running job in the YARN
>>web
>> >UI, as expected. However, the job doesn¹t actually appear to do
>>anything -
>> >messages are not produced to the ³wikipedia-raw² topic (nor is the
>>topic
>> >created), and no data is logged at all.
>> >
>> >To that point, I am having a ton of trouble with Samza¹s logging - in
>> >samza.log.dir on the ResourceManager node, there¹s only
>>gc.log.0.current,
>> >and in the YARN log directory I have only the resourcemanager log
>>which of
>> >course contains no application information. On the NodeManager side,
>> >samza.log.dir contains application-manager.log, which ends at "[INFO]
>> >Requesting 1 container(s) with 1700mb of memory² right after the job
>> >enters the Running state, it¹s own copy of gc.log.0.current, and stderr
>> >and stdout which contain no useful information and also don¹t grow
>>after
>> >the first second of the job running. In YARN¹s logs, there¹s only the
>>node
>> >manager log, which has no errors or warnings and just logs the startup
>>of
>> >the container and then its memory usage from then on, which seems fine:
>> >
>> >2015-03-31 20:17:34,635 INFO  [Container Monitor]
>> >monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) -
>> >Memory usage of ProcessTree 25767 for container-id
>> >container_1427823389325_0011_01_000001: 104.9 MB of 1 GB physical
>>memory
>> >used; 2.4 GB of 3.1 GB virtual memory used
>> >
>> >
>> >What am I missing here? WikipediaFeed.java contains a whole bunch of
>> >logging statements, but nothing ever hits any file I can find. Even if
>>you
>> >can¹t help with the problem I¹m having with hello-samza, I would
>>greatly
>> >appreciate any advice on how I can get useful logs from Samza jobs.
>> >
>> >I¹ve checked that I can ping the Wikipedia IRC URL and consume
>> >from/produce to the Kafka cluster with the console shell scripts from
>>both
>> >the ResourceManager and NodeManager nodes, and other applications can
>>work
>> >with my Kafka and Zookeeper with no issues. From the application-master
>> >log on the worker node, all I can see is that Samza configures the
>> >Wikipedia IRC system, starts the Webapp, and requests a container. It
>> >enters the Running state with YARN, after which point nothing happens
>>at
>> >all. There¹s no activity at all in the Kafka or Zookeeper logs.
>> >
>> >And that¹s it; the job will run for hours if I let it but at no point
>>is
>> >anything produced to Kafka or logged at all. I wrote a simpler task
>>that
>> >just accepts a json message from a topic on Kafka, adds a timestamp,
>>and
>> >produces to another topic, but almost nothing is different. From
>> >application-master log:
>> >
>> >2015-03-31 20:07:05 ClientUtils$ [INFO] Fetching metadata from broker
>> >id:0,host:172.31.2.19,port:9092 with correlation id 0 for 1 topic(s)
>> >Set(test)
>> >2015-03-31 20:07:05 SyncProducer [INFO] Connected to 172.31.2.19:9092
>>for
>> >producing
>> >2015-03-31 20:07:05 SyncProducer [INFO] Disconnecting from
>> >172.31.2.19:9092
>> >2015-03-31 20:07:06 KafkaSystemAdmin$ [INFO] Got metadata: Map(test ->
>> >SystemStreamMetadata [streamName=test, partitionMetadata={Partition
>> >[partition=0]=SystemStreamPartitionMetadata [oldestOffset=0,
>> >newestOffset=4, upcomingOffset=5], Partition
>> >[partition=1]=SystemStreamPartitionMetadata [oldestOffset=null,
>> >newestOffset=null, upcomingOffset=0]}])
>> >
>> >
>> >which all looks correct. Then it connects to ResourceManager, starts
>>the
>> >Webapp, Requests a container and starts running. All I see in Kafka¹s
>>log
>> >is
>> >
>> >[2015-03-31 20:07:05,999] INFO Closing socket connection to
>>/172.31.1.229
>> .
>> >(kafka.network.Processor)
>> >[2015-03-31 20:07:06,090] INFO Closing socket connection to
>>/172.31.1.229
>> .
>> >(kafka.network.Processor)
>> >
>> >
>> >and Zookeeper has nothing to say at all. As before, no new topic is
>> >created.
>> >
>> >So a huge part of this question is just, what am I missing about
>>logging?
>> >Where are the actual job/task-level logs? Aside from that, I just have
>>no
>> >explanation for why nothing is happening in either of these simple
>>tasks.
>> >I would really appreciate any insight anyone can offerŠ
>> >
>> >Oh, one more thing - there was an error message in Zookeeper after
>> >submitting my simple StreamTask that I haven¹t been able to reproduce:
>> >
>> >2015-03-31 19:48:28,145 [myid:] - INFO
>> >[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] -
>> >Accepted socket connection from /172.31.2.19:41801
>> >2015-03-31 19:48:28,147 [myid:] - WARN
>> >[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] -
>> >Connection request from old client /172.31.2.19:41801; will be dropped
>>if
>> >server is in r-o mode
>> >2015-03-31 19:48:28,148 [myid:] - INFO
>> >[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] -
>>Client
>> >attempting to establish new session at /172.31.2.19:41801
>> >2015-03-31 19:48:28,149 [myid:] - INFO
>>[SyncThread:0:ZooKeeperServer@617
>> ]
>> >- Established session 0x14c70bd0c3e0006 with negotiated timeout 30000
>>for
>> >client /172.31.2.19:41801
>> >2015-03-31 19:48:28,202 [myid:] - INFO  [ProcessThread(sid:0
>> >cport:-1)::PrepRequestProcessor@494] - Processed session termination
>>for
>> >sessionid: 0x14c70bd0c3e0006
>> >2015-03-31 19:48:28,206 [myid:] - INFO
>> >[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed
>> >socket connection for client /172.31.2.19:41801 which had sessionid
>> >0x14c70bd0c3e0006
>> >
>> >
>> >172.31.2.19 is the Kafka broker. The job continued unphased; Samza
>>didn¹t
>> >log anything about this socket being closed or any kind of error. Not
>>sure
>> >if that¹s related.
>> >
>> >
>> >Again, thanks a ton for reading and whatever help you can offer.
>> >
>> >Andrew Sannier
>> >Software Engineer, Big Data
>> >C: 480-284-1048
>> >www.helixeducation.com
>> >
>> >
>>
>>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message