spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Logistic Regression Iterations causing High GC in Spark 2.3
Date Mon, 29 Jul 2019 07:07:10 GMT
I would remove the all GC tuning and add it later once you found the underlying root cause.
Usually more GC means you need to provide more memory, because something has changed (your
application, spark Version etc.)

We don’t have your full code to give exact advise, but you may want to rethink the one code
/ executor approach and have less executors but more cores / executor. That sometimes can
lead to more heap usage (especially if you broadcast). Keep in mind that if you use more cores/executor
it usually also requires more memory for the executor, but less executors. Similarly the executor
instances might be too many and they may not have enough heap.
You can also increase the memory of the executor.

> Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati <>:
> Hi,
> We were running Logistic Regression in Spark 2.2.X and then we tried to see how does
it do in Spark 2.3.X. Now we are facing an issue while running a Logistic Regression Model
in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time
due to very High GC Activity. I have tuned the GC, created different sized clusters, higher
spark version(2.4.X), smaller data but nothing helps. The GC time is 100 - 1000 times of the
processing time in avg for iterations. 
> The strange part is in Spark 2.2 this doesn't happen at all. Same code, same cluster
sizing, same data in both the cases.
> I was wondering if someone can explain this behaviour and help me to resolve this. How
can the same code has so different behaviour in two Spark version, especially the higher ones?
> Here are the config which I used:
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> #GC Tuning
> spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions
-XX:+G1SummarizeConcMark -Xms9000m -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
> spark.executor.instances=20
> spark.executor.cores=1
> spark.executor.memory=9010m
> Regards,
> Dhrub

View raw message