spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <wbi...@gmail.com>
Subject Re: Optimize the first map reduce of DStream
Date Tue, 24 Mar 2015 14:03:52 GMT
I'm not looking for limit the block size.

Here is another example. Say we want to count the lines from the stream in
one hour. In a normal program, we may write it like this:

int sum = 0
while (line = getFromStream()) {
    store(line) // store the line into storage instead of memory.
    sum++
}

This could be seen as a reduce. The only memory used here is just the
variable named "line", need not store all the lines into memory (if lines
would not use in other places). If we want to provide fault tolerance, we
may just store lines into storage instead of in the memory. Could Spark
streaming work like this way? Dose Flink work like this?





On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara <zoltan.zvara@gmail.com> wrote:

> There is a BlockGenerator on each worker node next to the
> ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
> each interval (block_interval). These Blocks are passed to
> ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
> for storage. BlockInfos are passed to the driver. Mini-batches are created
> by the JobGenerator component on the driver each batch_interval. I guess
> what you are looking for is provided by a continuous model like Flink's. We
> are creating mini-batches to provide fault tolerance.
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zvara@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-00000021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)
>
> 2015-03-24 11:55 GMT+01:00 Arush Kharbanda <arush@sigmoidanalytics.com>:
>
>> The block size is configurable and that way I think you can reduce the
>> block interval, to keep the block in memory only for the limiter interval?
>> Is that what you are looking for?
>>
>> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wbin00@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I'm learning Spark and I find there could be some optimize for the
>> current
>> > streaming implementation. Correct me if I'm wrong.
>> >
>> > The current streaming implementation put the data of one batch into
>> memory
>> > (as RDD). But it seems not necessary.
>> >
>> > For example, if I want to count the lines which contains word "Spark", I
>> > just need to map every line to see if it contains word, then reduce it
>> with
>> > a sum function. After that, this line is no longer useful to keep it in
>> > memory.
>> >
>> > That is said, if the DStream only have one map and/or reduce operation
>> on
>> > it. It is not necessary to keep all the batch data in the memory.
>> Something
>> > like a pipeline should be OK.
>> >
>> > Is it difficult to implement on top of the current implementation?
>> >
>> > Thanks.
>> >
>> > ---
>> > Bin Wang
>> >
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message