spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltán Zvara <zoltan.zv...@gmail.com>
Subject Re: Spark remote communication pattern
Date Fri, 10 Apr 2015 11:15:08 GMT
Thank you for the hint!

I've found a HTTP and torrent type broadcast. It seems that
TorrentBroadcast is used. Was HTTP implemented earlier?
Broadcasting is done through SparkContext by user code as I see. But what
other events can trigger a TorrentBroadcast?

Zvara Zoltán



mail, hangout, skype: zoltan.zvara@gmail.com

mobile, viber: +36203129543

bank: 10918001-00000021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-04-09 19:04 GMT+02:00 Reynold Xin <rxin@databricks.com>:

> For torrent broadcast, data are read directly through the block manager:
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167
>
>
>
> On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara <zoltan.zvara@gmail.com>
> wrote:
>
>> Thanks! I've found the fetcher! Is there any other places and cases where
>> blocks are traveled through network?
>>
>> Zvara Zoltán
>>
>>
>>
>> mail, hangout, skype: zoltan.zvara@gmail.com
>>
>> mobile, viber: +36203129543
>>
>> bank: 10918001-00000021-50480008
>>
>> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>>
>> elte: HSKSJZ (ZVZOAAI.ELTE)
>>
>> 2015-04-09 10:24 GMT+02:00 Reynold Xin <rxin@databricks.com>:
>>
>>> Take a look at the following two files:
>>>
>>>
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
>>>
>>> and
>>>
>>>
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
>>>
>>> On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara <zoltan.zvara@gmail.com>
>>> wrote:
>>>
>>>> Dear Developers,
>>>>
>>>> I'm trying to investigate the communication pattern regarding data-flow
>>>> during execution of a Spark program defined by an RDD chain. I'm
>>>> investigating from the Task point of view, and found out that the task
>>>> type
>>>> ResultTask (as retrieving the iterator for its RDD for a given
>>>> partition),
>>>> effectively asks the BlockManager to get the block from local or remote
>>>> location. What I do there is to include actual location data in
>>>> BlockResult
>>>> so the task can tell where it retrieved the data from. I've found out
>>>> that
>>>> ResultTask can issue a data-flow only in this case.
>>>>
>>>> What's the case with the ShuffleMapTask? What happens there? I'm trying
>>>> to
>>>> log locations which are included in the shuffle process. I would be
>>>> happy
>>>> to receive a few hints regarding where remote communication is managed
>>>> in
>>>> case of ShuffleMapTask.
>>>>
>>>> Thanks!
>>>>
>>>> Zoltán
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message