spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Spark standalone network configuration problems
Date Thu, 26 Jun 2014 13:13:50 GMT
Do you have <ip>            machine1 in your workers /etc/hosts also? If so
try telneting from your machine2 to machine1 on port 5060. Also make sure
nothing else is running on port 5060 other than Spark (*lsof -i:5060*)

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu> wrote:

>  Still running into the same problem. /etc/hosts on the master says
>
> 127.0.0.1    localhost
> <ip>            machine1
>
> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
> other ideas?
>
>
> On 6/26/14, 3:11 AM, Akhil Das wrote:
>
>  Hi Shannon,
>
>  It should be a configuration issue, check in your /etc/hosts and make
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
>  Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu> wrote:
>
>>  Hi all,
>>
>> I have a 2-machine Spark network I've set up: a master and worker on
>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>> everything starts up as it should. I see both workers listed on the UI
>> page. The logs of both workers indicate successful registration with the
>> Spark master.
>>
>> The problems begin when I attempt to submit a job: I get an "address
>> already in use" exception that crashes the program. It says "Failed to bind
>> to " and lists the exact port and address of the master.
>>
>> At this point, the only items I have set in my spark-env.sh are
>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>
>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>> master to 127.0.0.1. This allows the master to successfully send out the
>> jobs; however, it ends up canceling the stage after running this command
>> several times:
>>
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>> (machine2:53597) with 8 cores
>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>> RAM
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now RUNNING
>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>
>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>> which point the program crashes. The worker on machine2 shows similar
>> messages in its logs. Here are the last bunch:
>>
>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>> app-20140625210032-0000/10 for app_name
>> Spark assembly has been built with Hive, including Datanucleus jars on
>> classpath
>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>> "app-20140625210032-0000"
>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>
>> I highlighted the part that seemed strange to me; that's the master port
>> number (I set it to 5060), and yet it's referencing localhost? Is this the
>> reason why machine2 apparently can't seem to give a confirmation to the
>> master once the job is submitted? (The logs from the worker on the master
>> node indicate that it's running just fine)
>>
>> I appreciate any assistance you can offer!
>>
>> Regards,
>> Shannon Quinn
>>
>>
>
>

Mime
View raw message