spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject Re: Spark standalone network configuration problems
Date Fri, 27 Jun 2014 13:07:48 GMT
I put the settings as you specified in spark-env.sh for the master. When 
I run start-all.sh, the web UI shows both the worker on the master 
(machine1) and the slave worker (machine2) as ALIVE and ready, with the 
master URL at spark://192.168.1.101. However, when I run spark-submit, 
it immediately crashes with

py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
error: [Startup failed]
akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to 
/192.168.1.101:5060
[...]
java.net.BindException: Address already in use.
[...]

This seems entirely contrary to intuition; why would Spark be unable to 
bind to the exact IP:port set for the master?

On 6/27/14, 1:54 AM, Akhil Das wrote:
> Hi Shannon,
>
> How about a setting like the following? (just removed the quotes)
>
> export SPARK_MASTER_IP=192.168.1.101
> export SPARK_MASTER_PORT=5060
> #export SPARK_LOCAL_IP=127.0.0.1
>
> Not sure whats happening in your case, it could be that your system is 
> not able to bind to 192.168.1.101 address. What is the spark:// master 
> url that you are seeing there in the webUI? (It should be 
> spark://192.168.1.101:7077 in your case).
>
>
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu 
> <mailto:squinn@gatech.edu>> wrote:
>
>     In the interest of completeness, this is how I invoke spark:
>
>     [on master]
>
>     > sbin/start-all.sh
>     > spark-submit --py-files extra.py main.py
>
>     iPhone'd
>
>     On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
>     <mailto:squinn@gatech.edu>> wrote:
>
>>     My *best guess* (please correct me if I'm wrong) is that the
>>     master (machine1) is sending the command to the worker (machine2)
>>     with the localhost argument as-is; that is, machine2 isn't doing
>>     any weird address conversion on its end.
>>
>>     Consequently, I've been focusing on the settings of the
>>     master/machine1. But I haven't found anything to indicate where
>>     the localhost argument could be coming from. /etc/hosts lists
>>     only 127.0.0.1 as localhost; spark-defaults.conf list
>>     spark.master as the full IP address (not 127.0.0.1); spark-env.sh
>>     on the master also lists the full IP under SPARK_MASTER_IP. The
>>     *only* place on the master where it's associated with localhost
>>     is SPARK_LOCAL_IP.
>>
>>     In looking at the logs of the worker spawned on master, it's also
>>     receiving a "spark://localhost:5060" argument, but since it
>>     resides on the master that works fine. Is it possible that the
>>     master is, for some reason, passing
>>     "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>
>>     That was my motivation behind commenting out SPARK_LOCAL_IP;
>>     however, that's when the master crashes immediately due to the
>>     address already being in use.
>>
>>     Any ideas? Thanks!
>>
>>     Shannon
>>
>>     On 6/26/14, 10:14 AM, Akhil Das wrote:
>>>     Can you paste your spark-env.sh file?
>>>
>>>     Thanks
>>>     Best Regards
>>>
>>>
>>>     On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>>     <squinn@gatech.edu <mailto:squinn@gatech.edu>> wrote:
>>>
>>>         Both /etc/hosts have each other's IP addresses in them.
>>>         Telneting from machine2 to machine1 on port 5060 works just
>>>         fine.
>>>
>>>         Here's the output of lsof:
>>>
>>>         user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>>         <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>>         lsof -i:5060
>>>         COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>>         java    23985 user   30u  IPv6 11092354  0t0  TCP
>>>         machine1:sip (LISTEN)
>>>         java    23985 user   40u  IPv6 11099560  0t0  TCP
>>>         machine1:sip->machine1:48315 (ESTABLISHED)
>>>         java    23985 user   52u  IPv6 11100405  0t0  TCP
>>>         machine1:sip->machine2:54476 (ESTABLISHED)
>>>         java    24157 user   40u  IPv6 11092413  0t0  TCP
>>>         machine1:48315->machine1:sip (ESTABLISHED)
>>>
>>>         Ubuntu seems to recognize 5060 as the standard port for
>>>         "sip"; it's not actually running anything there besides
>>>         Spark, it just does a s/5060/sip/g.
>>>
>>>         Is there something to the fact that every time I comment out
>>>         SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>>>         spark-submit due to the "address already being in use"? Or
>>>         am I barking up the wrong tree on that one?
>>>
>>>         Thanks again for all your help; I hope we can knock this one
>>>         out.
>>>
>>>         Shannon
>>>
>>>
>>>         On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>>         Do you have <ip>         machine1 in your workers
>>>>         /etc/hosts also? If so try telneting from your machine2 to
>>>>         machine1 on port 5060. Also make sure nothing else is
>>>>         running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>
>>>>         Thanks
>>>>         Best Regards
>>>>
>>>>
>>>>         On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>>         <squinn@gatech.edu <mailto:squinn@gatech.edu>> wrote:
>>>>
>>>>             Still running into the same problem. /etc/hosts on the
>>>>             master says
>>>>
>>>>             127.0.0.1    localhost
>>>>             <ip> machine1
>>>>
>>>>             <ip> is the same address set in spark-env.sh for
>>>>             SPARK_MASTER_IP. Any other ideas?
>>>>
>>>>
>>>>             On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>             Hi Shannon,
>>>>>
>>>>>             It should be a configuration issue, check in your
>>>>>             /etc/hosts and make sure localhost is not associated
>>>>>             with the SPARK_MASTER_IP you provided.
>>>>>
>>>>>             Thanks
>>>>>             Best Regards
>>>>>
>>>>>
>>>>>             On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>>             <squinn@gatech.edu <mailto:squinn@gatech.edu>>
wrote:
>>>>>
>>>>>                 Hi all,
>>>>>
>>>>>                 I have a 2-machine Spark network I've set up: a
>>>>>                 master and worker on machine1, and worker on
>>>>>                 machine2. When I run 'sbin/start-all.sh',
>>>>>                 everything starts up as it should. I see both
>>>>>                 workers listed on the UI page. The logs of both
>>>>>                 workers indicate successful registration with the
>>>>>                 Spark master.
>>>>>
>>>>>                 The problems begin when I attempt to submit a job:
>>>>>                 I get an "address already in use" exception that
>>>>>                 crashes the program. It says "Failed to bind to "
>>>>>                 and lists the exact port and address of the master.
>>>>>
>>>>>                 At this point, the only items I have set in my
>>>>>                 spark-env.sh are SPARK_MASTER_IP and
>>>>>                 SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>
>>>>>                 The next step I took, then, was to explicitly set
>>>>>                 SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>>                 allows the master to successfully send out the
>>>>>                 jobs; however, it ends up canceling the stage
>>>>>                 after running this command several times:
>>>>>
>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>                 Executor added: app-20140625210032-0000/8 on
>>>>>                 worker-20140625205623-machine2-53597
>>>>>                 (machine2:53597) with 8 cores
>>>>>                 14/06/25 21:00:47 INFO
>>>>>                 SparkDeploySchedulerBackend: Granted executor ID
>>>>>                 app-20140625210032-0000/8 on hostPort
>>>>>                 machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>                 Executor updated: app-20140625210032-0000/8 is now
>>>>>                 RUNNING
>>>>>                 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>>                 Executor updated: app-20140625210032-0000/8 is now
>>>>>                 FAILED (Command exited with code 1)
>>>>>
>>>>>                 The "/8" started at "/1", eventually becomes "/9",
>>>>>                 and then "/10", at which point the program
>>>>>                 crashes. The worker on machine2 shows similar
>>>>>                 messages in its logs. Here are the last bunch:
>>>>>
>>>>>                 14/06/25 21:00:31 INFO Worker: Executor
>>>>>                 app-20140625210032-0000/9 finished with state
>>>>>                 FAILED message Command exited with code 1 exitStatus
1
>>>>>                 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>>                 executor app-20140625210032-0000/10 for app_name
>>>>>                 Spark assembly has been built with Hive, including
>>>>>                 Datanucleus jars on classpath
>>>>>                 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>>                 command: "java" "-cp"
>>>>>                 "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>                 "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>>                 "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>>                 "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>>                 "10" "machine2" "8"
>>>>>                 "akka.tcp://sparkWorker@machine2:53597/user/Worker" "app-20140625210032-0000"
>>>>>                 14/06/25 21:00:33 INFO Worker: Executor
>>>>>                 app-20140625210032-0000/10 finished with state
>>>>>                 FAILED message Command exited with code 1 exitStatus
1
>>>>>
>>>>>                 I highlighted the part that seemed strange to me;
>>>>>                 that's the master port number (I set it to 5060),
>>>>>                 and yet it's referencing localhost? Is this the
>>>>>                 reason why machine2 apparently can't seem to give
>>>>>                 a confirmation to the master once the job is
>>>>>                 submitted? (The logs from the worker on the master
>>>>>                 node indicate that it's running just fine)
>>>>>
>>>>>                 I appreciate any assistance you can offer!
>>>>>
>>>>>                 Regards,
>>>>>                 Shannon Quinn
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>


Mime
View raw message