Share the conversation thread on the spark/mesos gpu support with broad
audience
---------- Forwarded message ----------
From: Ji Yan <jiyan@drive.ai>
Date: Thu, Dec 29, 2016 at 12:57 PM
Subject: Re: Spark/Mesos with GPU support
To: Timothy Chen <tnachen@gmail.com>
Hi Timothy, thanks for the help. it works now. just to explain what
happened. Your haunch is right that the spark conf is not passed in
correctly. First I went to Spark UI as you suggested and did not see the
mesos settings show up. After that, I changed to pass in the parameters
from command line into spark-submit directly and the settings show up and
worked. Thanks for the help again :)
Best,
Ji
On Wed, Dec 28, 2016 at 1:49 PM, Ji Yan <jiyan@drive.ai> wrote:
> More logs at GLOG_v=2
>
> I1228 13:46:42.653489 9566 hierarchical.cpp:271] Added framework
>> 993198d1-7393-4656-9f75-4f22702609d0-0014
>> I1228 13:46:42.653728 9566 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:42.653780 9566 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:42.653882 9566 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 349332ns
>> I1228 13:46:43.288697 9545 process.cpp:2677] Resuming (1)@ at 2016-12-28
>> 21:46:43.288666112+00:00
>> I1228 13:46:43.289139 9545 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:43.289204 9545 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:43.289358 9545 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 535266ns
>> I1228 13:46:44.290138 9552 process.cpp:2677] Resuming (1)@ at 2016-12-28
>> 21:46:44.290100992+00:00
>> I1228 13:46:44.290650 9552 hierarchical.cpp:1537] No allocations
>> performed
>> I1228 13:46:44.290724 9552 hierarchical.cpp:1632] No inverse offers to
>> send out!
>> I1228 13:46:44.290863 9552 hierarchical.cpp:1172] Performed allocation
>> for 1 agents in 572635ns
>> I1228 13:46:44.634511 9572 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.634493952+00:00
>> I1228 13:46:44.634510 9571 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.634486016+00:00
>> I1228 13:46:44.634603 9572 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master'
>> I1228 13:46:44.634991 9562 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.634964992+00:00
>> I1228 13:46:44.635152 9562 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/index.html' with length 4628
>> I1228 13:46:44.670963 9555 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.670938880+00:00
>> I1228 13:46:44.671020 9557 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.670981888+00:00
>> I1228 13:46:44.671140 9557 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/css/bootstrap-3.3.6.min.css'
>> I1228 13:46:44.671378 9561 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671341824+00:00
>> I1228 13:46:44.671475 9555 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671454976+00:00
>> I1228 13:46:44.671546 9557 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/css/mesos.css'
>> I1228 13:46:44.671617 9555 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/css/bootstrap-3.3.6.min.css'
>> with length 121260
>> I1228 13:46:44.671820 9567 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.671799808+00:00
>> I1228 13:46:44.671953 9567 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/css/mesos.css' with length
>> 2714
>> I1228 13:46:44.679502 9566 process.cpp:2677] Resuming __http__(51)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.679481088+00:00
>> I1228 13:46:44.679518 9546 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.679495936+00:00
>> I1228 13:46:44.679622 9546 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/jquery-1.7.1.min.js'
>> I1228 13:46:44.679975 9563 process.cpp:2677] Resuming __http__(51)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.679950080+00:00
>> I1228 13:46:44.680107 9563 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/jquery-1.7.1.min.js' with
>> length 93868
>> I1228 13:46:44.700376 9553 process.cpp:2677] Resuming __http__(17)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.700357888+00:00
>> I1228 13:46:44.700428 9541 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.700403968+00:00
>> I1228 13:46:44.700503 9541 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/underscore-1.4.3.min.js'
>> I1228 13:46:44.700806 9568 process.cpp:2677] Resuming __http__(17)@
>> 172.161.152:5050 <http://172.16.1.152:5050/> at 2016-12-28
>> 21:46:44.700773120+00:00
>> I1228 13:46:44.700948 9568 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/underscore-1.4.3.min.js'
>> with length 13432
>> I1228 13:46:44.701045 9552 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.701024000+00:00
>> I1228 13:46:44.701051 9542 process.cpp:2677] Resuming master@:5050
>> <http://master@172.16.1.152:5050/> at 2016-12-28 21:46:44.701029888+00:00
>> I1228 13:46:44.701148 9542 process.cpp:3323] Handling HTTP event for
>> process 'master' with path: '/master/static/js/zeroclipboard-1.1.7.js'
>> I1228 13:46:44.701447 9569 process.cpp:2677] Resuming __http__(49)@:5050
>> <http://172.16.1.152:5050/> at 2016-12-28 21:46:44.701420032+00:00
>> I1228 13:46:44.701582 9569 process.cpp:1280] Sending file at
>> '/usr/local/share/mesos/webui/master/static/js/zeroclipboard-1.1.7.js'
>> with length 16965
>>
>>
>> 20936,67 93%
>
>
> On Wed, Dec 28, 2016 at 1:20 PM, Ji Yan <jiyan@drive.ai> wrote:
>
>> These are some more logs that came after
>>
>> I1228 13:14:28.959544 9569 hierarchical.cpp:271] Added framework
>>> 993198d1-7393-4656-9f75-4f22702609d0-0007
>>> I1228 13:14:28.959837 9569 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:28.959900 9569 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:28.960018 9569 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 414146ns
>>> I1228 13:14:29.012315 9548 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:29.012393 9548 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:29.012488 9548 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 521825ns
>>> I1228 13:14:30.013526 9567 hierarchical.cpp:1537] No allocations
>>> performed
>>> I1228 13:14:30.013653 9567 hierarchical.cpp:1632] No inverse offers to
>>> send out!
>>> I1228 13:14:30.013790 9567 hierarchical.cpp:1172] Performed allocation
>>> for 1 agents in 647664ns
>>
>>
>> On Wed, Dec 28, 2016 at 12:46 PM, Ji Yan <jiyan@drive.ai> wrote:
>>
>>> I have set this environment variable before restarting mesos master
>>>
>>> export GLOG_v=1
>>>
>>> Ji
>>>
>>> On Wed, Dec 28, 2016 at 12:43 PM, Ji Yan <jiyan@drive.ai> wrote:
>>>
>>>> Thanks Timothy, I've added you on chat. Also these are the logs I get
>>>> from master
>>>>
>>>> I1228 12:40:49.213546 9544 master.cpp:2424] Received SUBSCRIBE call for
>>>>> framework 'SimpleApp' at scheduler-981ef901-4bb1-424d-a
>>>>> 46b-520afe191caa@:45540
>>>>> <http://scheduler-981ef901-4bb1-424d-a46b-520afe191caa@172.16.1.101:45540/>
>>>>> I1228 12:40:49.214092 9544 master.cpp:2500] Subscribing framework
>>>>> SimpleApp with checkpointing disabled and capabilities [ ]
>>>>> I1228 12:40:49.217025 9551 hierarchical.cpp:271] Added framework
>>>>> 993198d1-7393-4656-9f75-4f22702609d0-0000
>>>>> I1228 12:40:49.213546 9544 master.cpp:2424] Received SUBSCRIBE call
>>>>> for framework 'SimpleApp' at scheduler-981ef901-4bb1-424d-a
>>>>> 46b-520afe191caa@:45540
>>>>> <http://scheduler-981ef901-4bb1-424d-a46b-520afe191caa@172.16.1.101:45540/>
>>>>> I1228 12:40:49.214092 9544 master.cpp:2500] Subscribing framework
>>>>> SimpleApp with checkpointing disabled and capabilities [ ]
>>>>> I1228 12:40:49.217025 9551 hierarchical.cpp:271] Added framework
>>>>> 993198d1-7393-4656-9f75-4f22702609d0-0000
>>>>
>>>>
>>>> Thanks
>>>> Ji
>>>>
>>>> On Wed, Dec 28, 2016 at 12:33 PM, Timothy Chen <tnachen@gmail.com>
>>>> wrote:
>>>>
>>>>> Btw if it's easier you can add me on Google hangout chat.
>>>>>
>>>>> Tim
>>>>>
>>>>> On Wed, Dec 28, 2016 at 12:19 PM, Timothy Chen <tnachen@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Can you enable verbose logging (GLOG_v=1) on the master and look
at
>>>>>> what the master log says when the framework is registered?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Wed, Dec 28, 2016 at 12:02 PM, Ji Yan <jiyan@drive.ai> wrote:
>>>>>>
>>>>>>> It is empty, no resource has been allocated
>>>>>>>
>>>>>>>
>>>>>>> 在 2016年12月28日,上午11:54,Timothy Chen <tnachen@gmail.com>
写道:
>>>>>>>
>>>>>>> And what is the offer looking like coming from Mesos master?
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:37 AM, Ji Yan <jiyan@drive.ai>
wrote:
>>>>>>>
>>>>>>> Thanks Timothy, we had this set in spark configuration, the cluster
>>>>>>> has one
>>>>>>> node with two gpu cores
>>>>>>>
>>>>>>> conf.set('spark.mesos.gpus.max', '1')
>>>>>>>
>>>>>>> The test app we are trying to run now is very simple
>>>>>>>
>>>>>>>
>>>>>>> from pyspark import SparkContext, SparkConf
>>>>>>> conf = SparkConf()
>>>>>>> conf.set('spark.mesos.executor.docker.image',
>>>>>>> 'docker.drive.ai/spark_gpu_experiment:latest')
>>>>>>> conf.set('spark.mesos.executor.docker.volumes',
>>>>>>> '/cronut:/cronut:ro,spark-2.1.0-bin-spark-2.1-rc5-mesos:/spa
>>>>>>> rk-2.1.0-bin-spark-2.1-rc5-mesos')
>>>>>>> conf.set('spark.mesos.gpus.max', '1')
>>>>>>> sc = SparkContext(conf=conf, appName="SimpleApp")
>>>>>>> sc.setLogLevel('ALL')
>>>>>>> logFile = "/fig/home/jackie/spark_play/test.in"
>>>>>>> logData = sc.textFile(logFile, minPartitions=10).cache()
>>>>>>> logData.count()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> and this is the shell command to launch it
>>>>>>>
>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/fig/home/jackie/spark_play/libmes
>>>>>>> os.so
>>>>>>> # this is inside the container
>>>>>>> export
>>>>>>> SPARK_EXECUTOR_URI=/fig/home/jiyan/spark-2.1.0-bin-spark-2.1
>>>>>>> -rc5-mesos.tgz
>>>>>>> spark-2.1.0-bin-spark-2.1-rc5-mesos/bin/spark-submit \
>>>>>>> --master 'mesos://mesos_master_dev:5050' \
>>>>>>> -v SimpleApp.py
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Maybe you would be able to spot something wrong in this setup
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ji
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:25 AM, Timothy Chen <tnachen@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Ji,
>>>>>>>
>>>>>>> Did you specifiy GPU resources (spark.mesos.gpu.max) when you
launch
>>>>>>> Spark? The current design of GPU resources in Mesos is that it
tries
>>>>>>> to not offer resources with GPU to frameworks that don't ask
for GPU
>>>>>>> resources.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan <jiyan@drive.ai>
wrote:
>>>>>>>
>>>>>>> Dear Spark Users,
>>>>>>>
>>>>>>> Has anyone had successful experience running Spark on Mesos with
GPU
>>>>>>> support? We have a Mesos cluster that can see and offer nvidia
GPU
>>>>>>> resources. With Spark, it seems that the GPU support with Mesos
>>>>>>> (https://github.com/apache/spark/pull/14644) has only recently
been
>>>>>>> merged
>>>>>>> into Spark Master which is not found in 2.0.2 release yet. We
have a
>>>>>>> custom
>>>>>>> built Spark from 2.1-rc5 which is confirmed to have the above
change.
>>>>>>> However when we try to run any code from Spark on this Mesos
setup,
>>>>>>> the
>>>>>>> spark program hangs and keeps saying
>>>>>>>
>>>>>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>>>>>> check
>>>>>>> your cluster UI to ensure that workers are registered and have
>>>>>>> sufficient
>>>>>>> resources”
>>>>>>>
>>>>>>> We are pretty sure that the cluster has enough resources as there
is
>>>>>>> nothing
>>>>>>> running on it. If we disable the GPU support in configuration
and
>>>>>>> restart
>>>>>>> mesos and retry the same program, it would work.
>>>>>>>
>>>>>>> Any comment/advice on this greatly appreciated
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ji
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged.
>>>>>>> It is intended solely for the addressee. Access to this email
by
>>>>>>> anyone
>>>>>>> else
>>>>>>> is unauthorized. If you are not the intended recipient, any
>>>>>>> disclosure,
>>>>>>> copying, distribution or any action taken or omitted to be taken
in
>>>>>>> reliance
>>>>>>> on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged.
>>>>>>> It is intended solely for the addressee. Access to this email
by
>>>>>>> anyone else
>>>>>>> is unauthorized. If you are not the intended recipient, any
>>>>>>> disclosure,
>>>>>>> copying, distribution or any action taken or omitted to be taken
in
>>>>>>> reliance
>>>>>>> on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The information in this email is confidential and may be legally
>>>>>>> privileged. It is intended solely for the addressee. Access to
this email
>>>>>>> by anyone else is unauthorized. If you are not the intended recipient,
any
>>>>>>> disclosure, copying, distribution or any action taken or omitted
to be
>>>>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
On Fri, Dec 30, 2016 at 11:06 AM, Stephen Boesch <javadba@gmail.com> wrote:
> Would it be possible to share that communication? I am interested in this
> thread.
>
> 2016-12-30 11:02 GMT-08:00 Ji Yan <jiyan@drive.ai>:
>
>> Thanks Michael, Tim and I have touched base and thankfully the issue has
>> already been resolved
>>
>> On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt <mgummelt@mesosphere.io>
>> wrote:
>>
>>> I've cc'd Tim and Kevin, who worked on GPU support.
>>>
>>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan <jiyan@drive.ai> wrote:
>>>
>>>> Dear Spark Users,
>>>>
>>>> Has anyone had successful experience running Spark on Mesos with GPU
>>>> support? We have a Mesos cluster that can see and offer nvidia GPU
>>>> resources. With Spark, it seems that the GPU support with Mesos (
>>>> https://github.com/apache/spark/pull/14644) has only recently been
>>>> merged into Spark Master which is not found in 2.0.2 release yet. We have
a
>>>> custom built Spark from 2.1-rc5 which is confirmed to have the above
>>>> change. However when we try to run any code from Spark on this Mesos setup,
>>>> the spark program hangs and keeps saying
>>>>
>>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>>> check your cluster UI to ensure that workers are registered and have
>>>> sufficient resources”
>>>>
>>>> We are pretty sure that the cluster has enough resources as there is
>>>> nothing running on it. If we disable the GPU support in configuration and
>>>> restart mesos and retry the same program, it would work.
>>>>
>>>> Any comment/advice on this greatly appreciated
>>>>
>>>> Thanks,
>>>> Ji
>>>>
>>>>
>>>> The information in this email is confidential and may be legally
>>>> privileged. It is intended solely for the addressee. Access to this email
>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>
--
The information in this email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this email
by anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be
taken in reliance on it, is prohibited and may be unlawful.
|