spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Spark standalone network configuration problems
Date Fri, 27 Jun 2014 05:54:54 GMT
Hi Shannon,

How about a setting like the following? (just removed the quotes)

export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1

Not sure whats happening in your case, it could be that your system is not
able to bind to 192.168.1.101 address. What is the spark:// master url that
you are seeing there in the webUI? (It should be spark://192.168.1.101:7077
in your case).



Thanks
Best Regards


On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu> wrote:

> In the interest of completeness, this is how I invoke spark:
>
> [on master]
>
> > sbin/start-all.sh
> > spark-submit --py-files extra.py main.py
>
> iPhone'd
>
> On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu> wrote:
>
> My *best guess* (please correct me if I'm wrong) is that the master
> (machine1) is sending the command to the worker (machine2) with the
> localhost argument as-is; that is, machine2 isn't doing any weird address
> conversion on its end.
>
> Consequently, I've been focusing on the settings of the master/machine1.
> But I haven't found anything to indicate where the localhost argument could
> be coming from. /etc/hosts lists only 127.0.0.1 as localhost;
> spark-defaults.conf list spark.master as the full IP address (not
> 127.0.0.1); spark-env.sh on the master also lists the full IP under
> SPARK_MASTER_IP. The *only* place on the master where it's associated with
> localhost is SPARK_LOCAL_IP.
>
> In looking at the logs of the worker spawned on master, it's also
> receiving a "spark://localhost:5060" argument, but since it resides on the
> master that works fine. Is it possible that the master is, for some reason,
> passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>
> That was my motivation behind commenting out SPARK_LOCAL_IP; however,
> that's when the master crashes immediately due to the address already being
> in use.
>
> Any ideas? Thanks!
>
> Shannon
>
> On 6/26/14, 10:14 AM, Akhil Das wrote:
>
>  Can you paste your spark-env.sh file?
>
>  Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu> wrote:
>
>>  Both /etc/hosts have each other's IP addresses in them. Telneting from
>> machine2 to machine1 on port 5060 works just fine.
>>
>> Here's the output of lsof:
>>
>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>> COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>> java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
>> (LISTEN)
>> java    23985 user   40u  IPv6 11099560      0t0  TCP
>> machine1:sip->machine1:48315 (ESTABLISHED)
>> java    23985 user   52u  IPv6 11100405      0t0  TCP
>> machine1:sip->machine2:54476 (ESTABLISHED)
>> java    24157 user   40u  IPv6 11092413      0t0  TCP
>> machine1:48315->machine1:sip (ESTABLISHED)
>>
>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
>> actually running anything there besides Spark, it just does a s/5060/sip/g.
>>
>> Is there something to the fact that every time I comment out
>> SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due
>> to the "address already being in use"? Or am I barking up the wrong tree on
>> that one?
>>
>> Thanks again for all your help; I hope we can knock this one out.
>>
>> Shannon
>>
>>
>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>
>>  Do you have <ip>            machine1 in your workers /etc/hosts also?
>> If so try telneting from your machine2 to machine1 on port 5060. Also make
>> sure nothing else is running on port 5060 other than Spark (*lsof
>> -i:5060*)
>>
>>  Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu> wrote:
>>
>>>  Still running into the same problem. /etc/hosts on the master says
>>>
>>> 127.0.0.1    localhost
>>> <ip>            machine1
>>>
>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
>>> other ideas?
>>>
>>>
>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>
>>>  Hi Shannon,
>>>
>>>  It should be a configuration issue, check in your /etc/hosts and make
>>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>
>>>  Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu>
>>> wrote:
>>>
>>>>  Hi all,
>>>>
>>>> I have a 2-machine Spark network I've set up: a master and worker on
>>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>>>> everything starts up as it should. I see both workers listed on the UI
>>>> page. The logs of both workers indicate successful registration with the
>>>> Spark master.
>>>>
>>>> The problems begin when I attempt to submit a job: I get an "address
>>>> already in use" exception that crashes the program. It says "Failed to bind
>>>> to " and lists the exact port and address of the master.
>>>>
>>>> At this point, the only items I have set in my spark-env.sh are
>>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>
>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>>>> master to 127.0.0.1. This allows the master to successfully send out the
>>>> jobs; however, it ends up canceling the stage after running this command
>>>> several times:
>>>>
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>>>> (machine2:53597) with 8 cores
>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>>>> RAM
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now RUNNING
>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>
>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>>>> which point the program crashes. The worker on machine2 shows similar
>>>> messages in its logs. Here are the last bunch:
>>>>
>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>>>> finished with state FAILED message Command exited with code 1 exitStatus
1
>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>> app-20140625210032-0000/10 for app_name
>>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>>> classpath
>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>>>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>> "app-20140625210032-0000"
>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>>>> finished with state FAILED message Command exited with code 1 exitStatus
1
>>>>
>>>> I highlighted the part that seemed strange to me; that's the master
>>>> port number (I set it to 5060), and yet it's referencing localhost? Is this
>>>> the reason why machine2 apparently can't seem to give a confirmation to the
>>>> master once the job is submitted? (The logs from the worker on the master
>>>> node indicate that it's running just fine)
>>>>
>>>> I appreciate any assistance you can offer!
>>>>
>>>> Regards,
>>>> Shannon Quinn
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message