spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: spark streaming with checkpoint
Date Thu, 22 Jan 2015 16:59:59 GMT
Maybe you use a wrong approach - try something like hyperloglog or bitmap
structures as you can find them, for instance, in  redis. They are much
smaller
Le 22 janv. 2015 17:19, "Balakrishnan Narendran" <balu.naren@gmail.com> a
écrit :

> Thank you Jerry,
>        Does the window operation create new RDDs for each slide
> duration..? I am asking this because i see a constant increase in memory
> even when there is no logs received.
> If not checkpoint is there any alternative that you would suggest.?
>
>
> On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai <saisai.shao@intel.com>
> wrote:
>
>>  Hi,
>>
>>
>>
>> Seems you have such a large window (24 hours), so the phenomena of memory
>> increasing is expectable, because of window operation will cache the RDD
>> within this window in memory. So for your requirement, memory should be
>> enough to hold the data of 24 hours.
>>
>>
>>
>> I don’t think checkpoint in Spark Streaming can alleviate such problem,
>> because checkpoint are mainly for fault tolerance.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* balu.naren [mailto:balu.naren@gmail.com]
>> *Sent:* Tuesday, January 20, 2015 7:17 PM
>> *To:* user@spark.apache.org
>> *Subject:* spark streaming with checkpoint
>>
>>
>>
>> I am a beginner to spark streaming. So have a basic doubt regarding
>> checkpoints. My use case is to calculate the no of unique users by day. I
>> am using reduce by key and window for this. Where my window duration is 24
>> hours and slide duration is 5 mins. I am updating the processed record to
>> mongodb. Currently I am replace the existing record each time. But I see
>> the memory is slowly increasing over time and kills the process after 1 and
>> 1/2 hours(in aws small instance). The DB write after the restart clears all
>> the old data. So I understand checkpoint is the solution for this. But my
>> doubt is
>>
>>    - What should my check point duration be..? As per documentation it
>>    says 5-10 times of slide duration. But I need the data of entire day. So it
>>    is ok to keep 24 hrs.
>>    - Where ideally should the checkpoint be..? Initially when I receive
>>    the stream or just before the window operation or after the data reduction
>>    has taken place.
>>
>>
>> Appreciate your help.
>> Thank you
>>  ------------------------------
>>
>> View this message in context: spark streaming with checkpoint
>> <http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Mime
View raw message