spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltán Zvara <zoltan.zv...@gmail.com>
Subject Re: Optimize the first map reduce of DStream
Date Tue, 24 Mar 2015 14:21:29 GMT
​AFAIK Spark Streaming can not work in a way like this. Transformations are
made on DStreams, where DStreams are basically hold (time,
allocatedBlocksForBatch) pairs.​ Allocated blocks are allocated by the
JobGenerator, unallocated blocks (infos) are collected by
ReceivedBlockTracker. In Spark Streaming you define transformations and
actions on DStreams. The operators define RDD chains, tasks are created by
spark-core. You manipulate DStreams, not single unit of data. Flink for
example uses a continuous model. It optimizes for memory usage and latency.
Read the Spark Streaming paper and Spark paper for more reference.

Zvara Zoltán



mail, hangout, skype: zoltan.zvara@gmail.com

mobile, viber: +36203129543

bank: 10918001-00000021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-03-24 15:03 GMT+01:00 Bin Wang <wbin00@gmail.com>:

> I'm not looking for limit the block size.
>
> Here is another example. Say we want to count the lines from the stream in
> one hour. In a normal program, we may write it like this:
>
> int sum = 0
> while (line = getFromStream()) {
>     store(line) // store the line into storage instead of memory.
>     sum++
> }
>
> This could be seen as a reduce. The only memory used here is just the
> variable named "line", need not store all the lines into memory (if lines
> would not use in other places). If we want to provide fault tolerance, we
> may just store lines into storage instead of in the memory. Could Spark
> streaming work like this way? Dose Flink work like this?
>
>
>
>
>
> On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara <zoltan.zvara@gmail.com>
> wrote:
>
>> There is a BlockGenerator on each worker node next to the
>> ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
>> each interval (block_interval). These Blocks are passed to
>> ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
>> for storage. BlockInfos are passed to the driver. Mini-batches are created
>> by the JobGenerator component on the driver each batch_interval. I guess
>> what you are looking for is provided by a continuous model like Flink's. We
>> are creating mini-batches to provide fault tolerance.
>>
>> Zvara Zoltán
>>
>>
>>
>> mail, hangout, skype: zoltan.zvara@gmail.com
>>
>> mobile, viber: +36203129543
>>
>> bank: 10918001-00000021-50480008
>>
>> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>>
>> elte: HSKSJZ (ZVZOAAI.ELTE)
>>
>> 2015-03-24 11:55 GMT+01:00 Arush Kharbanda <arush@sigmoidanalytics.com>:
>>
>>> The block size is configurable and that way I think you can reduce the
>>> block interval, to keep the block in memory only for the limiter
>>> interval?
>>> Is that what you are looking for?
>>>
>>> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wbin00@gmail.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm learning Spark and I find there could be some optimize for the
>>> current
>>> > streaming implementation. Correct me if I'm wrong.
>>> >
>>> > The current streaming implementation put the data of one batch into
>>> memory
>>> > (as RDD). But it seems not necessary.
>>> >
>>> > For example, if I want to count the lines which contains word "Spark",
>>> I
>>> > just need to map every line to see if it contains word, then reduce it
>>> with
>>> > a sum function. After that, this line is no longer useful to keep it in
>>> > memory.
>>> >
>>> > That is said, if the DStream only have one map and/or reduce operation
>>> on
>>> > it. It is not necessary to keep all the batch data in the memory.
>>> Something
>>> > like a pipeline should be OK.
>>> >
>>> > Is it difficult to implement on top of the current implementation?
>>> >
>>> > Thanks.
>>> >
>>> > ---
>>> > Bin Wang
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>>
>>> *Arush Kharbanda* || Technical Teamlead
>>>
>>> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message