spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <>
Subject Re: Why do we ever run out of memory in Spark Structured Streaming?
Date Wed, 05 Apr 2017 15:49:34 GMT
Actually I want to reset my counters every 24 hours then shouldn't the
window and slide interval = 24 hours. If so, how do I send updates to real
time dashboard every second? isn't the trigger interval is the same as
slide interval ?

On Wed, Apr 5, 2017 at 7:17 AM, kant kodali <> wrote:

> One of our requirement is that we need to maintain counter for a 24 hour
> period such as number of transactions processed in the past 24 hours. After
> each day these counters can start from zero again so we just need to
> maintain a running count during the 24 hour period. Also since we want to
> show these stats on a real time dashboard we want those charts to be
> updated every second so I guess this would translate to window interval of
> 24 hours and a slide/trigger interval of 1 second. First of all, Is this
> okay ?
> Secondly, we push about 5000 JSON messages/sec into spark streaming and
> each message is about 2KB. we just need to parse those messages and compute
> say sum of certain fields  from each message and the result needs to be
> stored somewhere such that each run will take its result and add it up to
>  the previous run and this state have to be maintained for 24 hours and
> then we can reset it back to zero. so any advice on how to best approach
> this scenario?
> Thanks much!
> On Wed, Apr 5, 2017 at 12:39 AM, kant kodali <> wrote:
>> Hi!
>> I am talking about "stateful operations like aggregations". Does this
>> happen on heap or off heap by default? I came across a article where I saw
>> both on and off heap are possible but I am not sure what happens by default
>> and when Spark or Spark Structured Streaming decides to store off heap?
>> I don't even know what mapGroupsWithState does since It's not part of
>> spark 2.1 which is what we currently use. Any pointers would be great.
>> Thanks!
>> On Tue, Apr 4, 2017 at 5:34 PM, Tathagata Das <
>>> wrote:
>>> Are you referring to the memory usage of stateful operations like
>>> aggregations, or the new mapGroupsWithState?
>>> The current implementation of the internal state store (that maintains
>>> the stateful aggregates) is such that it keeps all the data in memory of
>>> the executor. It does use HDFS-compatible file system for checkpointing,
>>> but as of now, it currently keeps all the data in memory of the executor.
>>> This is something we will improve in the future.
>>> That said, you can enabled watermarking in your query that would
>>> automatically clear old, unnecessary state thus limiting the total memory
>>> used for stateful operations.
>>> Furthermore, you can also monitor the size of the state and get alerts
>>> if the state is growing too large.
>>> Read more in the programming guide.
>>> Watermarking -
>>> /latest/structured-streaming-programming-guide.html#handling
>>> -late-data-and-watermarking
>>> Monitoring -
>>> rogramming-guide.html#monitoring-streaming-queries
>>> In case you were referring to something else, please give us more
>>> context details - what query, what are the symptoms you are observing.
>>> On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <> wrote:
>>>> Why do we ever run out of memory in Spark Structured Streaming
>>>> especially when Memory can always spill to disk ? until the disk is full
>>>> shouldn't be out of memory.isn't it? sure thrashing will happen more
>>>> frequently and degrades performance but we do we ever run out Memory even
>>>> in case of maintaining a state for 6 months or a year?
>>>> Thanks!

View raw message