spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject Re: Spark standalone network configuration problems
Date Thu, 26 Jun 2014 14:19:41 GMT
export SPARK_MASTER_IP="192.168.1.101"
export SPARK_MASTER_PORT="5060"
export SPARK_LOCAL_IP="127.0.0.1"

That's it. If I comment out the SPARK_LOCAL_IP or set it to be the same 
as SPARK_MASTER_IP, that's when it throws the "address already in use" 
error. If I leave it as the localhost IP, that's when I get the 
communication errors with machine2 that ultimately lead to the job failure.

Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu 
> <mailto:squinn@gatech.edu>> wrote:
>
>     Both /etc/hosts have each other's IP addresses in them. Telneting
>     from machine2 to machine1 on port 5060 works just fine.
>
>     Here's the output of lsof:
>
>     user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>     COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>     java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
>     (LISTEN)
>     java    23985 user   40u  IPv6 11099560      0t0  TCP
>     machine1:sip->machine1:48315 (ESTABLISHED)
>     java    23985 user   52u  IPv6 11100405      0t0  TCP
>     machine1:sip->machine2:54476 (ESTABLISHED)
>     java    24157 user   40u  IPv6 11092413      0t0  TCP
>     machine1:48315->machine1:sip (ESTABLISHED)
>
>     Ubuntu seems to recognize 5060 as the standard port for "sip";
>     it's not actually running anything there besides Spark, it just
>     does a s/5060/sip/g.
>
>     Is there something to the fact that every time I comment out
>     SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>     spark-submit due to the "address already being in use"? Or am I
>     barking up the wrong tree on that one?
>
>     Thanks again for all your help; I hope we can knock this one out.
>
>     Shannon
>
>
>     On 6/26/14, 9:13 AM, Akhil Das wrote:
>>     Do you have <ip>         machine1 in your workers /etc/hosts
>>     also? If so try telneting from your machine2 to machine1 on port
>>     5060. Also make sure nothing else is running on port 5060 other
>>     than Spark (*/lsof -i:5060/*)
>>
>>     Thanks
>>     Best Regards
>>
>>
>>     On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
>>     <mailto:squinn@gatech.edu>> wrote:
>>
>>         Still running into the same problem. /etc/hosts on the master
>>         says
>>
>>         127.0.0.1    localhost
>>         <ip>            machine1
>>
>>         <ip> is the same address set in spark-env.sh for
>>         SPARK_MASTER_IP. Any other ideas?
>>
>>
>>         On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>         Hi Shannon,
>>>
>>>         It should be a configuration issue, check in your /etc/hosts
>>>         and make sure localhost is not associated with the
>>>         SPARK_MASTER_IP you provided.
>>>
>>>         Thanks
>>>         Best Regards
>>>
>>>
>>>         On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>         <squinn@gatech.edu <mailto:squinn@gatech.edu>> wrote:
>>>
>>>             Hi all,
>>>
>>>             I have a 2-machine Spark network I've set up: a master
>>>             and worker on machine1, and worker on machine2. When I
>>>             run 'sbin/start-all.sh', everything starts up as it
>>>             should. I see both workers listed on the UI page. The
>>>             logs of both workers indicate successful registration
>>>             with the Spark master.
>>>
>>>             The problems begin when I attempt to submit a job: I get
>>>             an "address already in use" exception that crashes the
>>>             program. It says "Failed to bind to " and lists the
>>>             exact port and address of the master.
>>>
>>>             At this point, the only items I have set in my
>>>             spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
>>>             (non-standard, set to 5060).
>>>
>>>             The next step I took, then, was to explicitly set
>>>             SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
>>>             the master to successfully send out the jobs; however,
>>>             it ends up canceling the stage after running this
>>>             command several times:
>>>
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             added: app-20140625210032-0000/8 on
>>>             worker-20140625205623-machine2-53597 (machine2:53597)
>>>             with 8 cores
>>>             14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
>>>             Granted executor ID app-20140625210032-0000/8 on
>>>             hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now RUNNING
>>>             14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now FAILED
>>>             (Command exited with code 1)
>>>
>>>             The "/8" started at "/1", eventually becomes "/9", and
>>>             then "/10", at which point the program crashes. The
>>>             worker on machine2 shows similar messages in its logs.
>>>             Here are the last bunch:
>>>
>>>             14/06/25 21:00:31 INFO Worker: Executor
>>>             app-20140625210032-0000/9 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>             14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>             app-20140625210032-0000/10 for app_name
>>>             Spark assembly has been built with Hive, including
>>>             Datanucleus jars on classpath
>>>             14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
>>>             "java" "-cp"
>>>             "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>             "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>             "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>             "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>             "10" "machine2" "8"
>>>             "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>             "app-20140625210032-0000"
>>>             14/06/25 21:00:33 INFO Worker: Executor
>>>             app-20140625210032-0000/10 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>
>>>             I highlighted the part that seemed strange to me; that's
>>>             the master port number (I set it to 5060), and yet it's
>>>             referencing localhost? Is this the reason why machine2
>>>             apparently can't seem to give a confirmation to the
>>>             master once the job is submitted? (The logs from the
>>>             worker on the master node indicate that it's running
>>>             just fine)
>>>
>>>             I appreciate any assistance you can offer!
>>>
>>>             Regards,
>>>             Shannon Quinn
>>>
>>>
>>
>>
>
>


Mime
View raw message