spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swapnil Shinde <swapnilushi...@gmail.com>
Subject Re: Spark driver locality
Date Fri, 28 Aug 2015 11:59:50 GMT
Thanks..
On Aug 28, 2015 4:55 AM, "Rishitesh Mishra" <rishi80.mishra@gmail.com>
wrote:

> Hi Swapnil,
>
> 1. All the task scheduling/retry happens from Driver. So you are right
> that a lot of communication happens from driver to cluster. It all depends
> on the how you want to go about your Spark application, whether your
> application has direct access to Spark cluster or its routed through a
> gateway machine. Accordingly you can take your decision.
>
> 2. I am not familiar with NFS layer concurrency. But parallel reads should
> be OK I think. Some one with the knowledge of NFS workings should correct
> if I am wrong.
>
>
> On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde <swapnilushinde@gmail.com>
> wrote:
>
>> Thanks Rishitesh !!
>> 1. I get that driver doesn't need to be on master but there is lot of
>> communication between driver and cluster. That's why co-located gateway was
>> recommended. How much is the impact of driver not being co-located with
>> cluster?
>>
>> 4. How does hdfs split get assigned to worker node to read data from
>> remote hadoop cluster? I am more interested to know how mapr NFS layer is
>> accessed in parallel.
>>
>> -
>> Swapnil
>>
>>
>> On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <
>> rishi80.mishra@gmail.com> wrote:
>>
>>> Hi Swapnil,
>>> Let me try to answer some of the questions. Answers inline. Hope it
>>> helps.
>>>
>>> On Thursday, August 27, 2015, Swapnil Shinde <swapnilushinde@gmail.com>
>>> wrote:
>>>
>>>> Hello
>>>> I am new to spark world and started to explore recently in standalone
>>>> mode. It would be great if I get clarifications on below doubts-
>>>>
>>>> 1. Driver locality - It is mentioned in documentation that "client"
>>>> deploy-mode is not good if machine running "spark-submit" is not co-located
>>>> with worker machines. cluster mode is not available with standalone
>>>> clusters. Therefore, do we have to submit all applications on master
>>>> machine? (Assuming we don't have separate co-located gateway machine)
>>>>
>>>
>>> No. In standalone mode also your master and driver machines can be
>>> different.
>>>
>>>> Driver should have access to Master as well as worker machines.
>>>>
>>>
>>>
>>>> 2. How does above driver locality work with spark shell running on
>>>> local machine ?
>>>>
>>>
>>> Spark shell itself acts as driver. This means your local machine should
>>> have access to all the cluster machines.
>>>
>>>>
>>>> 3. I am little confused with role of driver program. Does driver do any
>>>> computation in spark app life cycle? For instance, in simple row count app,
>>>> worker node calculates local row counts. Does driver sum up local row
>>>> counts? In short where does reduce phase runs in this case?
>>>>
>>>
>>> Role of driver is to co-ordinate with cluster manager for initial
>>> resource allocation. After that it needs to schedule tasks to different
>>> executors assigned to it. It does not do any computation.(unless the
>>> application itself does something on its own ). Reduce phase is also a
>>> bunch of tasks, which gets assigned to one or more executors.
>>>
>>>>
>>>> 4. In case of accessing hdfs data over network, do worker nodes read
>>>> data in parallel? How does hdfs data over network get accessed in spark
>>>> application?
>>>>
>>>
>>>
>>>> Yes. All worker will get a split to read. They read their own split in
>>>> parallel.This means all worker nodes should have access to Hadoop file
>>>> system.
>>>>
>>>
>>>
>>>> Sorry if these questions were already discussed..
>>>>
>>>> Thanks
>>>> Swapnil
>>>>
>>>
>>
>

Mime
View raw message