spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject Re: Spark standalone network configuration problems
Date Thu, 26 Jun 2014 13:05:33 GMT
Still running into the same problem. /etc/hosts on the master says

127.0.0.1    localhost
<ip>            machine1

<ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any 
other ideas?

On 6/26/14, 3:11 AM, Akhil Das wrote:
> Hi Shannon,
>
> It should be a configuration issue, check in your /etc/hosts and make 
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu 
> <mailto:squinn@gatech.edu>> wrote:
>
>     Hi all,
>
>     I have a 2-machine Spark network I've set up: a master and worker
>     on machine1, and worker on machine2. When I run
>     'sbin/start-all.sh', everything starts up as it should. I see both
>     workers listed on the UI page. The logs of both workers indicate
>     successful registration with the Spark master.
>
>     The problems begin when I attempt to submit a job: I get an
>     "address already in use" exception that crashes the program. It
>     says "Failed to bind to " and lists the exact port and address of
>     the master.
>
>     At this point, the only items I have set in my spark-env.sh are
>     SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>
>     The next step I took, then, was to explicitly set SPARK_LOCAL_IP
>     on the master to 127.0.0.1. This allows the master to successfully
>     send out the jobs; however, it ends up canceling the stage after
>     running this command several times:
>
>     14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>     app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>     (machine2:53597) with 8 cores
>     14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted
>     executor ID app-20140625210032-0000/8 on hostPort machine2:53597
>     with 8 cores, 8.0 GB RAM
>     14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>     app-20140625210032-0000/8 is now RUNNING
>     14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>     app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>
>     The "/8" started at "/1", eventually becomes "/9", and then "/10",
>     at which point the program crashes. The worker on machine2 shows
>     similar messages in its logs. Here are the last bunch:
>
>     14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>     14/06/25 21:00:31 INFO Worker: Asked to launch executor
>     app-20140625210032-0000/10 for app_name
>     Spark assembly has been built with Hive, including Datanucleus
>     jars on classpath
>     14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java"
>     "-cp"
>     "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>     "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>     "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>     "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>     "10" "machine2" "8"
>     "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>     "app-20140625210032-0000"
>     14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>
>     I highlighted the part that seemed strange to me; that's the
>     master port number (I set it to 5060), and yet it's referencing
>     localhost? Is this the reason why machine2 apparently can't seem
>     to give a confirmation to the master once the job is submitted?
>     (The logs from the worker on the master node indicate that it's
>     running just fine)
>
>     I appreciate any assistance you can offer!
>
>     Regards,
>     Shannon Quinn
>
>


Mime
View raw message