spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mosharaf Chowdhury <mosharafka...@gmail.com>
Subject Re: Problems with broadcast large datastructure
Date Mon, 13 Jan 2014 04:38:45 GMT
400MB isn't really that big. Broadcast is expected to work with several GB
of data and in even larger clusters (100s of machines).

if you are using the default HttpBroadcast, then akka isn't used to move
the broadcasted data. But block manager can run out of memory if you
repetitively broadcast large objects. Another scenario is that the master
isn't receiving any heartbeats from the blockmanager because the control
messages are getting dropped due to bulk data movement. Can you provide a
bit more details on your network setup?

Also, you can try doing a binary search over the size of broadcasted data
to see at what size it breaks (i.e, try to broadcast 10, then 20, then 40
etc etc.)? Also, limit each run to a single iteration in the example (right
now, it tries to broadcast 3 consecutive times).

If you are using a newer branch, you can also try the new TorrentBroadcast
implementation.


--
Mosharaf Chowdhury
http://www.mosharaf.com/


On Sun, Jan 12, 2014 at 8:22 PM, Aureliano Buendia <buendia360@gmail.com>wrote:

>
>
>
> On Mon, Jan 13, 2014 at 4:17 AM, lihu <lihu723@gmail.com> wrote:
>
>> I have occurred the same problem with you .
>> I have a node of 20 machines, and I just run the broadcast example, what
>> I do is just change the data size in the example, to 400M, this is really a
>> small data size.
>>
>
> Is 400 MB a really small size for broadcasting?
>
> I had the impression that broadcast is for object much much smaller, about
> less than 10 MB.
>
>
>> but I occurred the same problem with you .
>> *So I wonder maybe the broadcast capacity is weak in the spark system?*
>>
>>
>> here is my config:
>>
>> *SPARK_MEM=12g*
>> *SPARK_MASTER_WEBUI_PORT=12306*
>> *SPARK_WORKER_MEMORY=12g*
>> *SPARK_JAVA_OPTS+="-Dspark.executor.memory=8g -Dspark.akka.timeout=600
>>  -Dspark.local.dir=/disk3/lee/tmp -Dspark.worker.timeout=600
>> -Dspark.akka.frameSize=10000 -Dspark.akka.askTimeout=300
>> -Dspark.storage.blockManagerTimeoutIntervalMs=100000
>> -Dspark.akka.retry.wait=600 -Dspark.blockManagerHeartBeatMs=80000 -Xms15G
>> -Xmx15G -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jan 11, 2014 at 8:27 AM, Khanderao kand <khanderao.kand@gmail.com
>> > wrote:
>>
>>> If your object size > 10MB you may need to change spark.akka.frameSize.
>>>
>>> What is your spark, spark.akka.timeOut ?
>>>
>>> did you change   spark.akka.heartbeat.interval  ?
>>>
>>> BTW based on large size getting broadcasted across 25 nodes, you may want to
consider the frequency of such transfer and evaluate alternative patterns.
>>>
>>>
>>>
>>>
>>> On Tue, Jan 7, 2014 at 12:55 AM, Sebastian Schelter <ssc@apache.org>wrote:
>>>
>>>> Spark repeatedly fails broadcast a large object on a cluster of 25
>>>> machines for me.
>>>>
>>>> I get log messages like this:
>>>>
>>>> [spark-akka.actor.default-dispatcher-4] WARN
>>>> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager
>>>> BlockManagerId(3, cloud-33.dima.tu-berlin.de, 42185, 0) with no recent
>>>> heart beats: 134689ms exceeds 45000ms
>>>>
>>>> Is there something wrong with my config? Do I have to increase some
>>>> timeout?
>>>>
>>>> Thx,
>>>> Sebastian
>>>>
>>>
>>>
>>
>>
>>
>

Mime
View raw message