spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject Re: Spark standalone network configuration problems
Date Fri, 27 Jun 2014 13:34:44 GMT
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*, 
exactly as configured.

On 6/27/14, 9:07 AM, Shannon Quinn wrote:
> I put the settings as you specified in spark-env.sh for the master. 
> When I run start-all.sh, the web UI shows both the worker on the 
> master (machine1) and the slave worker (machine2) as ALIVE and ready, 
> with the master URL at spark://192.168.1.101. However, when I run 
> spark-submit, it immediately crashes with
>
> py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
> error: [Startup failed]
> akka.remote.RemoteTransportException: Startup failed
> [...]
> org.jboss.netty.channel.ChannelException: Failed to bind to 
> /192.168.1.101:5060
> [...]
> java.net.BindException: Address already in use.
> [...]
>
> This seems entirely contrary to intuition; why would Spark be unable 
> to bind to the exact IP:port set for the master?
>
> On 6/27/14, 1:54 AM, Akhil Das wrote:
>> Hi Shannon,
>>
>> How about a setting like the following? (just removed the quotes)
>>
>> export SPARK_MASTER_IP=192.168.1.101
>> export SPARK_MASTER_PORT=5060
>> #export SPARK_LOCAL_IP=127.0.0.1
>>
>> Not sure whats happening in your case, it could be that your system 
>> is not able to bind to 192.168.1.101 address. What is the spark:// 
>> master url that you are seeing there in the webUI? (It should be 
>> spark://192.168.1.101:7077 in your case).
>>
>>
>>
>> Thanks
>> Best Regards
>>
>>
>> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu 
>> <mailto:squinn@gatech.edu>> wrote:
>>
>>     In the interest of completeness, this is how I invoke spark:
>>
>>     [on master]
>>
>>     > sbin/start-all.sh
>>     > spark-submit --py-files extra.py main.py
>>
>>     iPhone'd
>>
>>     On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
>>     <mailto:squinn@gatech.edu>> wrote:
>>
>>>     My *best guess* (please correct me if I'm wrong) is that the
>>>     master (machine1) is sending the command to the worker
>>>     (machine2) with the localhost argument as-is; that is, machine2
>>>     isn't doing any weird address conversion on its end.
>>>
>>>     Consequently, I've been focusing on the settings of the
>>>     master/machine1. But I haven't found anything to indicate where
>>>     the localhost argument could be coming from. /etc/hosts lists
>>>     only 127.0.0.1 as localhost; spark-defaults.conf list
>>>     spark.master as the full IP address (not 127.0.0.1);
>>>     spark-env.sh on the master also lists the full IP under
>>>     SPARK_MASTER_IP. The *only* place on the master where it's
>>>     associated with localhost is SPARK_LOCAL_IP.
>>>
>>>     In looking at the logs of the worker spawned on master, it's
>>>     also receiving a "spark://localhost:5060" argument, but since it
>>>     resides on the master that works fine. Is it possible that the
>>>     master is, for some reason, passing
>>>     "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>>
>>>     That was my motivation behind commenting out SPARK_LOCAL_IP;
>>>     however, that's when the master crashes immediately due to the
>>>     address already being in use.
>>>
>>>     Any ideas? Thanks!
>>>
>>>     Shannon
>>>
>>>     On 6/26/14, 10:14 AM, Akhil Das wrote:
>>>>     Can you paste your spark-env.sh file?
>>>>
>>>>     Thanks
>>>>     Best Regards
>>>>
>>>>
>>>>     On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>>>     <squinn@gatech.edu <mailto:squinn@gatech.edu>> wrote:
>>>>
>>>>         Both /etc/hosts have each other's IP addresses in them.
>>>>         Telneting from machine2 to machine1 on port 5060 works just
>>>>         fine.
>>>>
>>>>         Here's the output of lsof:
>>>>
>>>>         user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>>>         <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>>>         lsof -i:5060
>>>>         COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>>>         java    23985 user   30u  IPv6 11092354    0t0  TCP
>>>>         machine1:sip (LISTEN)
>>>>         java    23985 user   40u  IPv6 11099560    0t0  TCP
>>>>         machine1:sip->machine1:48315 (ESTABLISHED)
>>>>         java    23985 user   52u  IPv6 11100405    0t0  TCP
>>>>         machine1:sip->machine2:54476 (ESTABLISHED)
>>>>         java    24157 user   40u  IPv6 11092413    0t0  TCP
>>>>         machine1:48315->machine1:sip (ESTABLISHED)
>>>>
>>>>         Ubuntu seems to recognize 5060 as the standard port for
>>>>         "sip"; it's not actually running anything there besides
>>>>         Spark, it just does a s/5060/sip/g.
>>>>
>>>>         Is there something to the fact that every time I comment
>>>>         out SPARK_LOCAL_IP in spark-env, it crashes immediately
>>>>         upon spark-submit due to the "address already being in
>>>>         use"? Or am I barking up the wrong tree on that one?
>>>>
>>>>         Thanks again for all your help; I hope we can knock this
>>>>         one out.
>>>>
>>>>         Shannon
>>>>
>>>>
>>>>         On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>>>         Do you have <ip>         machine1 in your workers
>>>>>         /etc/hosts also? If so try telneting from your machine2 to
>>>>>         machine1 on port 5060. Also make sure nothing else is
>>>>>         running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>>
>>>>>         Thanks
>>>>>         Best Regards
>>>>>
>>>>>
>>>>>         On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>>>         <squinn@gatech.edu <mailto:squinn@gatech.edu>> wrote:
>>>>>
>>>>>             Still running into the same problem. /etc/hosts on the
>>>>>             master says
>>>>>
>>>>>             127.0.0.1    localhost
>>>>>             <ip> machine1
>>>>>
>>>>>             <ip> is the same address set in spark-env.sh for
>>>>>             SPARK_MASTER_IP. Any other ideas?
>>>>>
>>>>>
>>>>>             On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>>             Hi Shannon,
>>>>>>
>>>>>>             It should be a configuration issue, check in your
>>>>>>             /etc/hosts and make sure localhost is not associated
>>>>>>             with the SPARK_MASTER_IP you provided.
>>>>>>
>>>>>>             Thanks
>>>>>>             Best Regards
>>>>>>
>>>>>>
>>>>>>             On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>>>             <squinn@gatech.edu <mailto:squinn@gatech.edu>>
wrote:
>>>>>>
>>>>>>                 Hi all,
>>>>>>
>>>>>>                 I have a 2-machine Spark network I've set up: a
>>>>>>                 master and worker on machine1, and worker on
>>>>>>                 machine2. When I run 'sbin/start-all.sh',
>>>>>>                 everything starts up as it should. I see both
>>>>>>                 workers listed on the UI page. The logs of both
>>>>>>                 workers indicate successful registration with the
>>>>>>                 Spark master.
>>>>>>
>>>>>>                 The problems begin when I attempt to submit a
>>>>>>                 job: I get an "address already in use" exception
>>>>>>                 that crashes the program. It says "Failed to bind
>>>>>>                 to " and lists the exact port and address of the
>>>>>>                 master.
>>>>>>
>>>>>>                 At this point, the only items I have set in my
>>>>>>                 spark-env.sh are SPARK_MASTER_IP and
>>>>>>                 SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>
>>>>>>                 The next step I took, then, was to explicitly set
>>>>>>                 SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>>>                 allows the master to successfully send out the
>>>>>>                 jobs; however, it ends up canceling the stage
>>>>>>                 after running this command several times:
>>>>>>
>>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>>                 Executor added: app-20140625210032-0000/8 on
>>>>>>                 worker-20140625205623-machine2-53597
>>>>>>                 (machine2:53597) with 8 cores
>>>>>>                 14/06/25 21:00:47 INFO
>>>>>>                 SparkDeploySchedulerBackend: Granted executor ID
>>>>>>                 app-20140625210032-0000/8 on hostPort
>>>>>>                 machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>>                 Executor updated: app-20140625210032-0000/8 is
>>>>>>                 now RUNNING
>>>>>>                 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>>>                 Executor updated: app-20140625210032-0000/8 is
>>>>>>                 now FAILED (Command exited with code 1)
>>>>>>
>>>>>>                 The "/8" started at "/1", eventually becomes
>>>>>>                 "/9", and then "/10", at which point the program
>>>>>>                 crashes. The worker on machine2 shows similar
>>>>>>                 messages in its logs. Here are the last bunch:
>>>>>>
>>>>>>                 14/06/25 21:00:31 INFO Worker: Executor
>>>>>>                 app-20140625210032-0000/9 finished with state
>>>>>>                 FAILED message Command exited with code 1
>>>>>>                 exitStatus 1
>>>>>>                 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>>>                 executor app-20140625210032-0000/10 for app_name
>>>>>>                 Spark assembly has been built with Hive,
>>>>>>                 including Datanucleus jars on classpath
>>>>>>                 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>>>                 command: "java" "-cp"
>>>>>>                 "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>>                 "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>>>                 "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>>>                 "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>>>                 "10" "machine2" "8"
>>>>>>                 "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>>>>                 "app-20140625210032-0000"
>>>>>>                 14/06/25 21:00:33 INFO Worker: Executor
>>>>>>                 app-20140625210032-0000/10 finished with state
>>>>>>                 FAILED message Command exited with code 1
>>>>>>                 exitStatus 1
>>>>>>
>>>>>>                 I highlighted the part that seemed strange to me;
>>>>>>                 that's the master port number (I set it to 5060),
>>>>>>                 and yet it's referencing localhost? Is this the
>>>>>>                 reason why machine2 apparently can't seem to give
>>>>>>                 a confirmation to the master once the job is
>>>>>>                 submitted? (The logs from the worker on the
>>>>>>                 master node indicate that it's running just fine)
>>>>>>
>>>>>>                 I appreciate any assistance you can offer!
>>>>>>
>>>>>>                 Regards,
>>>>>>                 Shannon Quinn
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Mime
View raw message