Hi,

I'm running Spark 2.0.1 version with Spark Launcher 2.0.1 version on Yarn cluster. I launch map task which spawns Spark job via SparkLauncher#startApplication().

Deploy mode is yarn-client. I'm running in Mac laptop.

I have this snippet of code:

SparkAppHandle appHandle = sparkLauncher.startApplication();

while (appHandle.getState() == null || !appHandle.getState().isFinal()) {
    if (appHandle.getState() != null) {
        log.info("while: Spark job state is : " + appHandle.getState());
        if (appHandle.getAppId() != null) {
            log.info("\t App id: " + appHandle.getAppId() + "\tState: " + appHandle.getState());
        }
    }
}

The above snippet of code works fine, both spark job and the map task which spawns that Spark job successfully completes.

But if i comment out the red highlighted line, then the Spark job launches and finishes successfully, but the map task hangs for a while (in Running state) and then fails with the exception below.

I run exact same code in exact same environment except that one line commented out. 

When the highlighted line is commented out, I even see the 2nd log line in the stderr either, it seems appHandle hook never returns back anything (neither app id nor app state), even though spark application starts, runs and finishes successfully. Inside the same stderr, i can see Spark job related logs, and spark job results printed, and application report indicating status.

You can see the exception below (this is from the stderr of the mapper container which launches Spark job):
---

INFO: Communication exception: java.net.ConnectException: Call From <my-hostname>/10.3.8.118 to <my-hostname>:53567 failed on connection exception: java.net.ConnectException: Connection refused;

Caused by: java.net.ConnectException: Connection refused

        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)

        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)

        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)

        at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)

        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)

        at org.apache.hadoop.ipc.Client.call(Client.java:1451)

        ... 5 more

---

Nov 05, 2016 2:41:54 AM org.apache.hadoop.ipc.Client handleConnectionFailure

INFO: Retrying connect to server: <my-hostname>/10.3.8.118:53567. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

Nov 05, 2016 2:41:54 AM org.apache.hadoop.mapred.Task run

INFO: Communication exception: java.net.ConnectException: Call From <my-hostname>/10.3.8.118 to <my-hostname>:53567 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)

        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)

        at org.apache.hadoop.ipc.Client.call(Client.java:1479)

        at org.apache.hadoop.ipc.Client.call(Client.java:1412)

        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)

        at com.sun.proxy.$Proxy9.ping(Unknown Source)

        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:767)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.net.ConnectException: Connection refused

        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)

        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)

        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)

        at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)

        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)

        at org.apache.hadoop.ipc.Client.call(Client.java:1451)

        ... 5 more

---

Nov 05, 2016 2:41:54 AM org.apache.hadoop.mapred.Task logThreadInfo

INFO: Process Thread Dump: Communication exception

10 active threads

Thread 24 (org.apache.hadoop.hdfs.PeerCache@4763c727):

  State: TIMED_WAITING

  Blocked count: 0

  Waited count: 79

  Stack:

    java.lang.Thread.sleep(Native Method)

    org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:255)

    org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)

    org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)

    java.lang.Thread.run(Thread.java:745)