samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@linkedin.com>
Subject Re: Running Samza on multi node
Date Thu, 07 Nov 2013 15:48:33 GMT
Hey Nirmal,

Thanks for this detailed report! It makes things much easier to figure out. The problem appears
to be that the Samza AM is trying to connect to 0.0.0.0:8030 when trying to talk to the RM.
This is an RM port, which is running on 192.168.145.37 (the RM host), not 192.168.145.43 (the
NM host). This is causing a timeout, since 8030 isn't open on localhost for the Samza AM,
which is running on the NM's box.

It is somewhat interesting that the NM does connect to the RM for the capacity scheduler.
Rather than setting each individual host/port pair, as you've done, I recommend just setting:

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.145.37</value>
  </property>

Your netstat reports look fine – as expected.

Other questions:

1. Both your NM and RM are running YARN 2.2.0, right?
2. It appears that your AM shuts down. Did you run kill-job.sh to kill it?

Regarding (2), it appears that the AM never tries to register. This normally happens. I'm
wondering if another failure is being triggered, which is then causing the AM to try and shut
itself down. Could you turn on debugging for your Samza job (in log4j.xml), and re-run? I'm
curious if the web-service that's starting up, or the registration itself is failing. In a
normal execution, you would expect to see:


    info("Got AM register response. The YARN RM supports container requests with max-mem:
%s, max-cpu: %s" format (maxMem, maxCpu))

I don't see this in your logs, which means the AM is failing (and triggering a shutdown) before
it even tries to register.

Cheers,
Chris

From: Nirmal Kumar <nirmal.kumar@impetus.co.in<mailto:nirmal.kumar@impetus.co.in>>
Reply-To: "dev@samza.incubator.apache.org<mailto:dev@samza.incubator.apache.org>" <dev@samza.incubator.apache.org<mailto:dev@samza.incubator.apache.org>>
Date: Thursday, November 7, 2013 5:05 AM
To: "dev@samza.incubator.apache.org<mailto:dev@samza.incubator.apache.org>" <dev@samza.incubator.apache.org<mailto:dev@samza.incubator.apache.org>>
Subject: Running Samza on multi node

All,

I was able to run the hello-samza application on a single node machine.
Now I am trying to run the hello-samza application on  a 2 node setup.

Node1 has a Resource Manager
Node2 has a Node Manager

The NM gets registered with the RM successfully as seen in rm.log of the RM node:
13/11/07 11:44:29 INFO service.AbstractService: Service:ResourceManager is started.
13/11/07 11:48:30 INFO util.RackResolver: Resolved IMPETUS-DSRV14.impetus.co.in to /default-rack
13/11/07 11:48:30 INFO resourcemanager.ResourceTrackerService: NodeManager from node IMPETUS-DSRV14.impetus.co.in(cmPort:
56093 httpPort: 8042) registered with capability: <memory:8192, vCores:16>, assigned
nodeId IMPETUS-DSRV14.impetus.co.in:56093
13/11/07 11:48:30 INFO rmnode.RMNodeImpl: IMPETUS-DSRV14.impetus.co.in:56093 Node Transitioned
from NEW to RUNNING
13/11/07 11:48:30 INFO capacity.CapacityScheduler: Added node IMPETUS-DSRV14.impetus.co.in:56093
clusterResource: <memory:8192, vCores:16>

I am submitting the job from the RM machine using the command line:
bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory
--config-path=file:/home/bda/nirmal/hello-samza/deploy/samza/config/test-consumer.properties

However, I am getting the following exception after submitting the job to YARN:

2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got container id: container_1383816757258_0001_01_000001
2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got app attempt id: appattempt_1383816757258_0001_000001
2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got node manager host: IMPETUS-DSRV14
2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got node manager port: 59828
2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got node manager http port: 8042
2013-11-07 15:05:57 SamzaAppMaster$ [INFO] got config: {task.inputs=kafka.storm-sentence,
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory, systems.kafka.samza.consumer.factory=samza.stream.kafka.KafkaConsumerFactory,
job.name=test-Consumer, systems.kafka.consumer.zookeeper.connect=192.168.145.195:2181/, systems.kafka.consumer.auto.offset.reset=largest,
systems.kafka.samza.msg.serde=json, serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory,
systems.kafka.samza.partition.manager=samza.stream.kafka.KafkaPartitionManager, task.window.ms=10000,
task.class=samza.examples.wikipedia.task.TestConsumer, yarn.package.path=file:/home/temptest/samza+storm/hello-samza/samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz,
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory, systems.kafka.producer.metadata.broker.list=192.168.145.195:9092,192.168.145.195:9093}
2013-11-07 15:05:57 ClientHelper [INFO] trying to connect to RM /0.0.0.0:8032
2013-11-07 15:05:57 JmxServer [INFO] According to InetAddress.getLocalHost.getHostName we
are IMPETUS-DSRV14.impetus.co.in
2013-11-07 15:05:57 JmxServer [INFO] Started JmxServer port=47115 url=service:jmx:rmi:///jndi/rmi://IMPETUS-DSRV14.impetus.co.in:47115/jmxrmi
2013-11-07 15:05:57 SamzaAppMasterTaskManager [INFO] No yarn.container.count specified. Defaulting
to one container.
2013-11-07 15:05:57 VerifiableProperties [INFO] Verifying properties
2013-11-07 15:05:57 VerifiableProperties [INFO] Property client.id is overridden to samza_admin-test_Consumer-1-1383816957797-0
2013-11-07 15:05:57 VerifiableProperties [INFO] Property metadata.broker.list is overridden
to 192.168.145.195:9092,192.168.145.195:9093
2013-11-07 15:05:57 VerifiableProperties [INFO] Verifying properties
2013-11-07 15:05:57 VerifiableProperties [INFO] Property auto.offset.reset is overridden to
largest
2013-11-07 15:05:57 VerifiableProperties [INFO] Property client.id is overridden to samza_admin-test_Consumer-1-1383816957797-0
2013-11-07 15:05:57 VerifiableProperties [INFO] Property group.id is overridden to undefined-samza-consumer-group-
2013-11-07 15:05:57 VerifiableProperties [INFO] Property zookeeper.connect is overridden to
192.168.145.195:2181/
2013-11-07 15:05:57 VerifiableProperties [INFO] Verifying properties
2013-11-07 15:05:57 VerifiableProperties [INFO] Property client.id is overridden to samza_admin-test_Consumer-1-1383816957797-0
2013-11-07 15:05:57 VerifiableProperties [INFO] Property metadata.broker.list is overridden
to 192.168.145.195:9092,192.168.145.195:9093
2013-11-07 15:05:57 VerifiableProperties [INFO] Property request.timeout.ms is overridden
to 6000
2013-11-07 15:05:57 ClientUtils$ [INFO] Fetching metadata from broker id:0,host:192.168.145.195,port:9092
with correlation id 0 for 1 topic(s) Set(storm-sentence)
2013-11-07 15:05:57 SyncProducer [INFO] Connected to 192.168.145.195:9092 for producing
2013-11-07 15:05:57 SyncProducer [INFO] Disconnecting from 192.168.145.195:9092
2013-11-07 15:05:57 SamzaAppMasterService [INFO] Starting webapp at rpc 39152, tracking port
26751
2013-11-07 15:05:57 log [INFO] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.eclipse.jetty.util.log)
via org.eclipse.jetty.util.log.Slf4jLog
2013-11-07 15:05:58 ClientHelper [INFO] trying to connect to RM /0.0.0.0:8032
2013-11-07 15:05:58 log [INFO] jetty-7.0.0.v20091005
2013-11-07 15:05:58 log [INFO] Extract jar:file:/tmp/hadoop-vuser/nm-local-dir/usercache/bda/appcache/application_1383816757258_0001/filecache/8004956396276725272/samza-job-package-0.7.0-dist.tar.gz/lib/samza-yarn_2.8.1-0.7.0-yarn-2.0.5-alpha.jar!/scalate/WEB-INF/
to /tmp/Jetty_0_0_0_0_39152_scalate____xveaws/webinf/WEB-INF
2013-11-07 15:05:58 ServletTemplateEngine [INFO] Scalate template engine using working directory:
/tmp/scalate-5279562760844696556-workdir
2013-11-07 15:05:58 log [INFO] Started SelectChannelConnector@0.0.0.0<mailto:SelectChannelConnector@0.0.0.0>:39152
2013-11-07 15:05:58 log [INFO] jetty-7.0.0.v20091005
2013-11-07 15:05:58 log [INFO] Extract jar:file:/tmp/hadoop-vuser/nm-local-dir/usercache/bda/appcache/application_1383816757258_0001/filecache/8004956396276725272/samza-job-package-0.7.0-dist.tar.gz/lib/samza-yarn_2.8.1-0.7.0-yarn-2.0.5-alpha.jar!/scalate/WEB-INF/
to /tmp/Jetty_0_0_0_0_26751_scalate____.dr19qj/webinf/WEB-INF
2013-11-07 15:05:58 ServletTemplateEngine [INFO] Scalate template engine using working directory:
/tmp/scalate-5582747144249485577-workdir
2013-11-07 15:05:58 log [INFO] Started SelectChannelConnector@0.0.0.0<mailto:SelectChannelConnector@0.0.0.0>:26751
2013-11-07 15:06:08 SamzaAppMasterLifecycle [INFO] Shutting down.
2013-11-07 15:06:18 YarnAppMaster [WARN] Listener org.apache.samza.job.yarn.SamzaAppMasterLifecycle@500c954e
failed to shutdown.
java.lang.reflect.UndeclaredThrowableException
         at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
         at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:90)
         at org.apache.hadoop.yarn.client.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:244)
         at org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onShutdown(SamzaAppMasterLifecycle.scala:68)
         at org.apache.samza.job.yarn.YarnAppMaster$$anonfun$run$9.apply(YarnAppMaster.scala:70)
         at org.apache.samza.job.yarn.YarnAppMaster$$anonfun$run$9.apply(YarnAppMaster.scala:69)
         at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
         at scala.collection.immutable.List.foreach(List.scala:45)
         at org.apache.samza.job.yarn.YarnAppMaster.run(YarnAppMaster.scala:69)
         at org.apache.samza.job.yarn.SamzaAppMaster$.main(SamzaAppMaster.scala:78)
         at org.apache.samza.job.yarn.SamzaAppMaster.main(SamzaAppMaster.scala)
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From IMPETUS-DSRV14.impetus.co.in/192.168.145.43
to 0.0.0.0:8030 failed on connection exception: java.net.ConnectException: Connection refused;
For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
         at $Proxy12.finishApplicationMaster(Unknown Source)
         at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:87)
         ... 9 more
Caused by: java.net.ConnectException: Call From IMPETUS-DSRV14.impetus.co.in/192.168.145.43
to 0.0.0.0:8030 failed on connection exception: java.net.ConnectException: Connection refused;
For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:780)
         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:727)
         at org.apache.hadoop.ipc.Client.call(Client.java:1239)
         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
         ... 11 more
Caused by: java.net.ConnectException: Connection refused
         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
         at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:526)
         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:490)
         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:508)
         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:603)
         at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:253)
         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1288)
         at org.apache.hadoop.ipc.Client.call(Client.java:1206)
         ... 12 more


I have changed the following properties in the hello-samza/deploy/yarn/etc/hadoop/yarn-site.xml
on the Node Manager machine:

<property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>192.168.145.37:8030</value>
</property>
<property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>192.168.145.37:8031</value>
</property>
<property>
                <name>yarn.resourcemanager.address</name>
                <value>192.168.145.37:8032</value>
</property>
<property>
                <name>yarn.resourcemanager.admin.address</name>
                <value>192.168.145.37:8033</value>
</property>
<property>
                <name>yarn.resourcemanager.webapp.address</name>
                <value>192.168.145.37:8088</value>
</property>



These properties are reflected on the UI screen as well:

[cid:image001.png@01CEDBE8.1B2F1890]

But this overriding of the yarn.resourcemanager.scheduler.address to 192.168.145.37:8030 does
not rectify the error.
I still get:
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From IMPETUS-DSRV14.impetus.co.in/192.168.145.43
to 0.0.0.0:8030 failed on connection exception: java.net.ConnectException: Connection refused;
For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Nestat on the RM machine shows me:
tcp        0      0 ::ffff:192.168.145.37:8088  :::*                        LISTEN      14595/java
tcp        0      0 ::ffff:192.168.145.37:8030  :::*                        LISTEN      14595/java
tcp        0      0 ::ffff:192.168.145.37:8031  :::*                        LISTEN      14595/java
tcp        0      0 ::ffff:192.168.145.37:8032  :::*                        LISTEN      14595/java
tcp        0      0 ::ffff:192.168.145.37:8033  :::*                        LISTEN      14595/java

Nestat on the NM machine shows me:
tcp        0      0 :::8040                     :::*                        LISTEN      1331/java
tcp        0      0 :::8042                     :::*                        LISTEN      1331/java
tcp        0      0 :::56877                    :::*                        LISTEN      1331/java

Kindly help me how to rectify this error.

Regards,
-Nirmal

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or
otherwise protected by law. The message is intended solely for the named addressee. If received
in error, please destroy and notify the sender. Any use of this email is prohibited when received
in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors, virus, interception
or interference.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message