spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Why do we ever run out of memory in Spark Structured Streaming?
Date Wed, 05 Apr 2017 00:34:44 GMT
Are you referring to the memory usage of stateful operations like
aggregations, or the new mapGroupsWithState?
The current implementation of the internal state store (that maintains the
stateful aggregates) is such that it keeps all the data in memory of the
executor. It does use HDFS-compatible file system for checkpointing, but as
of now, it currently keeps all the data in memory of the executor. This is
something we will improve in the future.

That said, you can enabled watermarking in your query that would
automatically clear old, unnecessary state thus limiting the total memory
used for stateful operations.
Furthermore, you can also monitor the size of the state and get alerts if
the state is growing too large.

Read more in the programming guide.
Watermarking -
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Monitoring -
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries

In case you were referring to something else, please give us more context
details - what query, what are the symptoms you are observing.

On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth909@gmail.com> wrote:

> Why do we ever run out of memory in Spark Structured Streaming especially
> when Memory can always spill to disk ? until the disk is full we shouldn't
> be out of memory.isn't it? sure thrashing will happen more frequently and
> degrades performance but we do we ever run out Memory even in case of
> maintaining a state for 6 months or a year?
>
> Thanks!
>

Mime
View raw message