spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ji Yan <ji...@drive.ai>
Subject Re: Spark/Mesos with GPU support
Date Fri, 30 Dec 2016 19:48:28 GMT
Share the conversation thread on the spark/mesos gpu support with broad
audience

---------- Forwarded message ----------
From: Ji Yan <jiyan@drive.ai>
Date: Thu, Dec 29, 2016 at 12:57 PM
Subject: Re: Spark/Mesos with GPU support
To: Timothy Chen <tnachen@gmail.com>


Hi Timothy, thanks for the help. it works now. just to explain what
happened. Your haunch is right that the spark conf is not passed in
correctly. First I went to Spark UI as you suggested and did not see the
mesos settings show up. After that, I changed to pass in the parameters
from command line into spark-submit directly and the settings show up and
worked. Thanks for the help again :)

Best,
Ji

On Wed, Dec 28, 2016 at 1:49 PM, Ji Yan <jiyan@drive.ai> wrote:

> More logs at GLOG_v=2
>
> I1228 13:46:42.653489  9566 hierarchical.cpp:271] Added framework
>> 993198d1-7393-4656-9f75-4f22702609d0-0014
>> I1228 13:46:42.653728  9566 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:42.653780  9566 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:42.653882  9566 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 349332ns
>> I1228 13:46:43.288697  9545 process.cpp:2677] Resuming (1)@ at 2016-12-28
>> 21:46:43.288666112+00:00
>> I1228 13:46:43.289139  9545 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:43.289204  9545 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:43.289358  9545 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 535266ns
>> I1228 13:46:44.290138  9552 process.cpp:2677] Resuming (1)@ at 2016-12-28
>> 21:46:44.290100992+00:00
>> I1228 13:46:44.290650  9552 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:44.290724  9552 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:44.290863  9552 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 572635ns
>> I1228 13:46:44.634511  9572 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.634493952+00:00
>> I1228 13:46:44.634510  9571 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.634486016+00:00
>> I1228 13:46:44.634603  9572 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master'
>> I1228 13:46:44.634991  9562 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.634964992+00:00
>> I1228 13:46:44.635152  9562 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/index.html' with length 4628
>> I1228 13:46:44.670963  9555 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.670938880+00:00
>> I1228 13:46:44.671020  9557 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.670981888+00:00
>> I1228 13:46:44.671140  9557 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/css/bootstrap-3.3.6.min.css'
>> I1228 13:46:44.671378  9561 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671341824+00:00
>> I1228 13:46:44.671475  9555 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671454976+00:00
>> I1228 13:46:44.671546  9557 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/css/mesos.css'
>> I1228 13:46:44.671617  9555 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/css/bootstrap-3.3.6.min.css'
>> with length 121260
>> I1228 13:46:44.671820  9567 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671799808+00:00
>> I1228 13:46:44.671953  9567 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/css/mesos.css' with length
>> 2714
>> I1228 13:46:44.679502  9566 process.cpp:2677] Resuming __http__(51)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.679481088+00:00
>> I1228 13:46:44.679518  9546 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.679495936+00:00
>> I1228 13:46:44.679622  9546 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/jquery-1.7.1.min.js'
>> I1228 13:46:44.679975  9563 process.cpp:2677] Resuming __http__(51)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.679950080+00:00
>> I1228 13:46:44.680107  9563 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/jquery-1.7.1.min.js' with
>> length 93868
>> I1228 13:46:44.700376  9553 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.700357888+00:00
>> I1228 13:46:44.700428  9541 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.700403968+00:00
>> I1228 13:46:44.700503  9541 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/underscore-1.4.3.min.js'
>> I1228 13:46:44.700806  9568 process.cpp:2677] Resuming __http__(17)@
>> 172.161.152:5050 <http://172.16.1.152:5050/> at 2016-12-28
>> 21:46:44.700773120+00:00
>> I1228 13:46:44.700948  9568 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/underscore-1.4.3.min.js'
>> with length 13432
>> I1228 13:46:44.701045  9552 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.701024000+00:00
>> I1228 13:46:44.701051  9542 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.701029888+00:00
>> I1228 13:46:44.701148  9542 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/zeroclipboard-1.1.7.js'
>> I1228 13:46:44.701447  9569 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.701420032+00:00
>> I1228 13:46:44.701582  9569 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/zeroclipboard-1.1.7.js'
>> with length 16965
>>
>>
>>               20936,67      93%
>
>
> On Wed, Dec 28, 2016 at 1:20 PM, Ji Yan <jiyan@drive.ai> wrote:
>
>> These are some more logs that came after
>>
>> I1228 13:14:28.959544 9569 hierarchical.cpp:271] Added framework
>>> 993198d1-7393-4656-9f75-4f22702609d0-0007
>>> I1228 13:14:28.959837 9569 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:28.959900 9569 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:28.960018 9569 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 414146ns
>>> I1228 13:14:29.012315 9548 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:29.012393 9548 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:29.012488 9548 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 521825ns
>>> I1228 13:14:30.013526 9567 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:30.013653 9567 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:30.013790 9567 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 647664ns
>>
>>
>> On Wed, Dec 28, 2016 at 12:46 PM, Ji Yan <jiyan@drive.ai> wrote:
>>
>>> I have set this environment variable before restarting mesos master
>>>
>>> export GLOG_v=1
>>>
>>> Ji
>>>
>>> On Wed, Dec 28, 2016 at 12:43 PM, Ji Yan <jiyan@drive.ai> wrote:
>>>
>>>> Thanks Timothy, I've added you on chat. Also these are the logs I get
>>>> from master
>>>>
>>>> I1228 12:40:49.213546 9544 master.cpp:2424] Received SUBSCRIBE call for
>>>>> framework 'SimpleApp' at scheduler-981ef901-4bb1-424d-a
>>>>> 46b-520afe191caa@:45540
>>>>> <http://scheduler-981ef901-4bb1-424d-a46b-520afe191caa@172.16.1.101:45540/>
>>>>> I1228 12:40:49.214092 9544 master.cpp:2500] Subscribing framework
>>>>> SimpleApp with checkpointing disabled and capabilities [ ]
>>>>> I1228 12:40:49.217025 9551 hierarchical.cpp:271] Added framework
>>>>> 993198d1-7393-4656-9f75-4f22702609d0-0000
>>>>> I1228 12:40:49.213546 9544 master.cpp:2424] Received SUBSCRIBE call
>>>>> for framework 'SimpleApp' at scheduler-981ef901-4bb1-424d-a
>>>>> 46b-520afe191caa@:45540
>>>>> <http://scheduler-981ef901-4bb1-424d-a46b-520afe191caa@172.16.1.101:45540/>
>>>>> I1228 12:40:49.214092 9544 master.cpp:2500] Subscribing framework
>>>>> SimpleApp with checkpointing disabled and capabilities [ ]
>>>>> I1228 12:40:49.217025 9551 hierarchical.cpp:271] Added framework
>>>>> 993198d1-7393-4656-9f75-4f22702609d0-0000
>>>>
>>>>
>>>> Thanks
>>>> Ji
>>>>
>>>> On Wed, Dec 28, 2016 at 12:33 PM, Timothy Chen <tnachen@gmail.com>
>>>>  wrote:
>>>>
>>>>> Btw if it's easier you can add me on Google hangout chat.
>>>>>
>>>>> Tim
>>>>>
>>>>> On Wed, Dec 28, 2016 at 12:19 PM, Timothy Chen <tnachen@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>> Can you enable verbose logging (GLOG_v=1) on the master and look
at
>>>>>> what the master log says when the framework is registered?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Wed, Dec 28, 2016 at 12:02 PM, Ji Yan <jiyan@drive.ai> wrote:
>>>>>>
>>>>>>> It is empty, no resource has been allocated
>>>>>>>
>>>>>>>
>>>>>>> 在 2016年12月28日,上午11:54,Timothy Chen <tnachen@gmail.com>
写道:
>>>>>>>
>>>>>>> And what is the offer looking like coming from Mesos master?
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:37 AM, Ji Yan <jiyan@drive.ai>
wrote:
>>>>>>>
>>>>>>> Thanks Timothy, we had this set in spark configuration, the cluster
>>>>>>> has one
>>>>>>> node with two gpu cores
>>>>>>>
>>>>>>> conf.set('spark.mesos.gpus.max', '1')
>>>>>>>
>>>>>>> The test app we are trying to run now is very simple
>>>>>>>
>>>>>>>
>>>>>>> from pyspark import SparkContext, SparkConf
>>>>>>> conf = SparkConf()
>>>>>>> conf.set('spark.mesos.executor.docker.image',
>>>>>>> 'docker.drive.ai/spark_gpu_experiment:latest')
>>>>>>> conf.set('spark.mesos.executor.docker.volumes',
>>>>>>> '/cronut:/cronut:ro,spark-2.1.0-bin-spark-2.1-rc5-mesos:/spa
>>>>>>> rk-2.1.0-bin-spark-2.1-rc5-mesos')
>>>>>>> conf.set('spark.mesos.gpus.max', '1')
>>>>>>> sc = SparkContext(conf=conf, appName="SimpleApp")
>>>>>>> sc.setLogLevel('ALL')
>>>>>>> logFile = "/fig/home/jackie/spark_play/test.in"
>>>>>>> logData = sc.textFile(logFile, minPartitions=10).cache()
>>>>>>> logData.count()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> and this is the shell command to launch it
>>>>>>>
>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/fig/home/jackie/spark_play/libmes
>>>>>>> os.so
>>>>>>> # this is inside the container
>>>>>>> export
>>>>>>> SPARK_EXECUTOR_URI=/fig/home/jiyan/spark-2.1.0-bin-spark-2.1
>>>>>>> -rc5-mesos.tgz
>>>>>>> spark-2.1.0-bin-spark-2.1-rc5-mesos/bin/spark-submit \
>>>>>>>  --master 'mesos://mesos_master_dev:5050' \
>>>>>>>  -v SimpleApp.py
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Maybe you would be able to spot something wrong in this setup
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ji
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:25 AM, Timothy Chen <tnachen@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Ji,
>>>>>>>
>>>>>>> Did you specifiy GPU resources (spark.mesos.gpu.max) when you
launch
>>>>>>> Spark? The current design of GPU resources in Mesos is that it
tries
>>>>>>> to not offer resources with GPU to frameworks that don't ask
for GPU
>>>>>>> resources.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan <jiyan@drive.ai>
wrote:
>>>>>>>
>>>>>>> Dear Spark Users,
>>>>>>>
>>>>>>> Has anyone had successful experience running Spark on Mesos with
GPU
>>>>>>> support? We have a Mesos cluster that can see and offer nvidia
GPU
>>>>>>> resources. With Spark, it seems that the GPU support with Mesos
>>>>>>> (https://github.com/apache/spark/pull/14644) has only recently
been
>>>>>>> merged
>>>>>>> into Spark Master which is not found in 2.0.2 release yet. We
have a
>>>>>>> custom
>>>>>>> built Spark from 2.1-rc5 which is confirmed to have the above
change.
>>>>>>> However when we try to run any code from Spark on this Mesos
setup,
>>>>>>> the
>>>>>>> spark program hangs and keeps saying
>>>>>>>
>>>>>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>>>>>> check
>>>>>>> your cluster UI to ensure that workers are registered and have
>>>>>>> sufficient
>>>>>>> resources”
>>>>>>>
>>>>>>> We are pretty sure that the cluster has enough resources as there
is
>>>>>>> nothing
>>>>>>> running on it. If we disable the GPU support in configuration
and
>>>>>>> restart
>>>>>>> mesos and retry the same program, it would work.
>>>>>>>
>>>>>>> Any comment/advice on this greatly appreciated
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ji
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged.
>>>>>>> It is intended solely for the addressee. Access to this email
by
>>>>>>> anyone
>>>>>>> else
>>>>>>> is unauthorized. If you are not the intended recipient, any
>>>>>>> disclosure,
>>>>>>> copying, distribution or any action taken or omitted to be taken
in
>>>>>>> reliance
>>>>>>> on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged.
>>>>>>> It is intended solely for the addressee. Access to this email
by
>>>>>>> anyone else
>>>>>>> is unauthorized. If you are not the intended recipient, any
>>>>>>> disclosure,
>>>>>>> copying, distribution or any action taken or omitted to be taken
in
>>>>>>> reliance
>>>>>>> on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged. It is intended solely for the addressee. Access to
this email
>>>>>>> by anyone else is unauthorized. If you are not the intended recipient,
any
>>>>>>> disclosure, copying, distribution or any action taken or omitted
to be
>>>>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

On Fri, Dec 30, 2016 at 11:06 AM, Stephen Boesch <javadba@gmail.com> wrote:

> Would it be possible to share that communication?  I am interested in this
> thread.
>
> 2016-12-30 11:02 GMT-08:00 Ji Yan <jiyan@drive.ai>:
>
>> Thanks Michael, Tim and I have touched base and thankfully the issue has
>> already been resolved
>>
>> On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt <mgummelt@mesosphere.io>
>> wrote:
>>
>>> I've cc'd Tim and Kevin, who worked on GPU support.
>>>
>>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan <jiyan@drive.ai> wrote:
>>>
>>>> Dear Spark Users,
>>>>
>>>> Has anyone had successful experience running Spark on Mesos with GPU
>>>> support? We have a Mesos cluster that can see and offer nvidia GPU
>>>> resources. With Spark, it seems that the GPU support with Mesos (
>>>> https://github.com/apache/spark/pull/14644) has only recently been
>>>> merged into Spark Master which is not found in 2.0.2 release yet. We have
a
>>>> custom built Spark from 2.1-rc5 which is confirmed to have the above
>>>> change. However when we try to run any code from Spark on this Mesos setup,
>>>> the spark program hangs and keeps saying
>>>>
>>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>>> check your cluster UI to ensure that workers are registered and have
>>>> sufficient resources”
>>>>
>>>> We are pretty sure that the cluster has enough resources as there is
>>>> nothing running on it. If we disable the GPU support in configuration and
>>>> restart mesos and retry the same program, it would work.
>>>>
>>>> Any comment/advice on this greatly appreciated
>>>>
>>>> Thanks,
>>>> Ji
>>>>
>>>>
>>>> The information in this email is confidential and may be legally
>>>> privileged. It is intended solely for the addressee. Access to this email
>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Mime
View raw message