spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xin Liu <>
Subject Re: SizeEstimator
Date Tue, 27 Feb 2018 03:24:49 GMT
Thanks David. Another solution is to convert the protobuf object to byte
array, It does speed up SizeEstimator

On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <> wrote:

> This is used to predict the current cost of memory so spark knows to flush
> or not. This is very costly for us so we use a flag marked in the code as
> private to lower the cost
> spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no
> typo) - how many records before flush
> This lowers the cost because it let's us leave data in young, if we don't
> bound we get everyone promoted to old and GC becomes a issue.  This doesn't
> solve the fact that the walk is slow, but lowers the cost of GC. For us we
> make sure to have spare memory on the system for page cache so spilling to
> disk for us is a memory write 99% of the time.  If your host has less free
> memory spilling may become more expensive.
> If the walk is your bottleneck and not GC then I would recommend JOL and
> guessing to better predict memory.
> On Mon, Feb 26, 2018, 4:47 PM Xin Liu <> wrote:
>> Hi folks,
>> We have a situation where, shuffled data is protobuf based, and
>> SizeEstimator is taking a lot of time.
>> We have tried to override SizeEstimator to return a constant value, which
>> speeds up things a lot.
>> My questions, what is the side effect of disabling SizeEstimator? Is it
>> just spark do memory reallocation, or there is more severe consequences?
>> Thanks!

View raw message