spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Logistic Regression Iterations causing High GC in Spark 2.3
Date Mon, 29 Jul 2019 14:12:04 GMT
Could be lots of things. Implementations change, caching may have
changed, etc. The size of the input doesn't really directly translate
to heap usage. Here you just need a bit more memory.

On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati <dhruba.work@gmail.com> wrote:
>
> Hi Sean,
>
> Yeah I checked the heap, its almost full. I checked the GC logs in the executors where
I found that GC cycles are kicking in frequently. The Executors tab shows red in the "Total
Time/GC Time".
>
> Also the data which I am dealing with is quite small(~4 GB) and the cluster is quite
big for that high GC.
>
> But what's troubling me is this issue doesn't occur in Spark 2.2 at all. What could be
the reason behind such a behaviour?
>
> Regards,
> Dhrub
>
> On Mon, Jul 29, 2019 at 6:45 PM Sean Owen <srowen@gmail.com> wrote:
>>
>> -dev@
>>
>> Yep, high GC activity means '(almost) out of memory'. I don't see that
>> you've checked heap usage - is it nearly full?
>> The answer isn't tuning but more heap.
>> (Sometimes with really big heaps the problem is big pauses, but that's
>> not the case here.)
>>
>> On Mon, Jul 29, 2019 at 1:26 AM Dhrubajyoti Hati <dhruba.work@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > We were running Logistic Regression in Spark 2.2.X and then we tried to see
how does it do in Spark 2.3.X. Now we are facing an issue while running a Logistic Regression
Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the TreeAggregate method it takes a
huge time due to very High GC Activity. I have tuned the GC, created different sized clusters,
higher spark version(2.4.X), smaller data but nothing helps. The GC time is 100 - 1000 times
of the processing time in avg for iterations.
>> >
>> > The strange part is in Spark 2.2 this doesn't happen at all. Same code, same
cluster sizing, same data in both the cases.
>> >
>> > I was wondering if someone can explain this behaviour and help me to resolve
this. How can the same code has so different behaviour in two Spark version, especially the
higher ones?
>> >
>> > Here are the config which I used:
>> >
>> >
>> > spark.serializer=org.apache.spark.serializer.KryoSerializer
>> >
>> > #GC Tuning
>> >
>> > spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions
-XX:+G1SummarizeConcMark -Xms9000m -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> >
>> >
>> > spark.executor.instances=20
>> >
>> > spark.executor.cores=1
>> >
>> > spark.executor.memory=9010m
>> >
>> >
>> >
>> > Regards,
>> > Dhrub
>> >

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message