spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: Why do we ever run out of memory in Spark Structured Streaming?
Date Wed, 05 Apr 2017 14:17:07 GMT
One of our requirement is that we need to maintain counter for a 24 hour
period such as number of transactions processed in the past 24 hours. After
each day these counters can start from zero again so we just need to
maintain a running count during the 24 hour period. Also since we want to
show these stats on a real time dashboard we want those charts to be
updated every second so I guess this would translate to window interval of
24 hours and a slide/trigger interval of 1 second. First of all, Is this
okay ?

Secondly, we push about 5000 JSON messages/sec into spark streaming and
each message is about 2KB. we just need to parse those messages and compute
say sum of certain fields  from each message and the result needs to be
stored somewhere such that each run will take its result and add it up to
 the previous run and this state have to be maintained for 24 hours and
then we can reset it back to zero. so any advice on how to best approach
this scenario?

Thanks much!

On Wed, Apr 5, 2017 at 12:39 AM, kant kodali <kanth909@gmail.com> wrote:

> Hi!
>
> I am talking about "stateful operations like aggregations". Does this
> happen on heap or off heap by default? I came across a article where I saw
> both on and off heap are possible but I am not sure what happens by default
> and when Spark or Spark Structured Streaming decides to store off heap?
>
> I don't even know what mapGroupsWithState does since It's not part of
> spark 2.1 which is what we currently use. Any pointers would be great.
>
> Thanks!
>
>
>
>
>
> On Tue, Apr 4, 2017 at 5:34 PM, Tathagata Das <tathagata.das1565@gmail.com
> > wrote:
>
>> Are you referring to the memory usage of stateful operations like
>> aggregations, or the new mapGroupsWithState?
>> The current implementation of the internal state store (that maintains
>> the stateful aggregates) is such that it keeps all the data in memory of
>> the executor. It does use HDFS-compatible file system for checkpointing,
>> but as of now, it currently keeps all the data in memory of the executor.
>> This is something we will improve in the future.
>>
>> That said, you can enabled watermarking in your query that would
>> automatically clear old, unnecessary state thus limiting the total memory
>> used for stateful operations.
>> Furthermore, you can also monitor the size of the state and get alerts if
>> the state is growing too large.
>>
>> Read more in the programming guide.
>> Watermarking - http://spark.apache.org/docs/latest/structured-streaming-
>> programming-guide.html#handling-late-data-and-watermarking
>> Monitoring - http://spark.apache.org/docs/latest/structured-streaming-
>> programming-guide.html#monitoring-streaming-queries
>>
>> In case you were referring to something else, please give us more context
>> details - what query, what are the symptoms you are observing.
>>
>> On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth909@gmail.com> wrote:
>>
>>> Why do we ever run out of memory in Spark Structured Streaming
>>> especially when Memory can always spill to disk ? until the disk is full we
>>> shouldn't be out of memory.isn't it? sure thrashing will happen more
>>> frequently and degrades performance but we do we ever run out Memory even
>>> in case of maintaining a state for 6 months or a year?
>>>
>>> Thanks!
>>>
>>
>>
>

Mime
View raw message