spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sfbayeng <>
Subject [SPARK-SQL] Spark Persist slower than non-persist calls
Date Fri, 01 Sep 2017 21:39:14 GMT
My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB.
Dynamic allocation turned on. spark.executor.memory=6G,

First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and
doing left outer join.
Next, I apply 7 different filter conditions based on joined
dataset(something line var line1=joinedDf.filter("linenumber=1"),var
line2=joinedDf.filter("l_linenumber=2, etc). Because I'm doing filter on
joned dataset multiple times, I thought doing a persist(MEMORY_ONLY) should
help here as the joined dataset will fit fully in memory.

1. I noticed that with persist, spark job takes longer time to run than
non-persist(3.5 mins vs 3.3 mins). With persist, the DAG shows that a single
stage got created for persist and other downstream jobs are waiting for the
persist to complete. Does that mean persist is a blocking call? Or do stages
in other jobs start processing as and when persisted blocks become
2. In non-persist case, different jobs are creating different stages to read
the same data. Data is read multiple times in different stages, but this is
still is turning out to be faster than the persist case.
3. With larger data sets, persist actually causes executors to run out of
memory: Java heap space. Without persist, the spark jobs complete just fine.
I looked at some other suggestions here: Spark java.lang.OutOfMemoryError:
Java heap space I tried   increasing/decreasing executor cores, persisting
with disk only, increasing partitions, modifying storage ratio, but nothing
seems to help with executor memory issues.

Would appreciate if someone could mention how persist works, in what cases
it is faster than not-persisting and more importantly, how to go about
troubleshooting out of memory issues.


Sent from:

To unsubscribe e-mail:

View raw message