spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sfbayeng <sfbay...@yahoo.com>
Subject [SPARK-SQL] Spark Persist slower than non-persist call.
Date Thu, 28 Sep 2017 17:06:14 GMT
My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB.
    Dynamic allocation turned on. spark.executor.memory=6G,
    spark.executor.cores=6
   
    First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and
    doing left outer join.
    Next, I apply 7 different filter conditions based on joined
    dataset(something line var line1=joinedDf.filter("linenumber=1"),var
    line2=joinedDf.filter("l_linenumber=2, etc). Because I'm doing filter on
    joned dataset multiple times, I thought doing a persist(MEMORY_ONLY)
should
    help here as the joined dataset will fit fully in memory.
   
    1. I noticed that with persist, spark job takes longer time to run than
    non-persist(3.5 mins vs 3.3 mins). With persist, the DAG shows that a
single
    stage got created for persist and other downstream jobs are waiting for
the
    persist to complete. Does that mean persist is a blocking call? Or do
stages
    in other jobs start processing as and when persisted blocks become
    available?
    2. In non-persist case, different jobs are creating different stages to
read
    the same data. Data is read multiple times in different stages, but this
is
    still is turning out to be faster than the persist case.
    3. With larger data sets, persist actually causes executors to run out
of
    memory: Java heap space. Without persist, the spark jobs complete just
fine.
    I looked at some other suggestions here: Spark
java.lang.OutOfMemoryError:
    Java heap space I tried  increasing/decreasing executor cores,
persisting
    with disk only, increasing partitions, modifying storage ratio, but
nothing
    seems to help with executor memory issues. 
    4. Also, I posted this on stack overflow and tried those suggestions of
persisting joined dataset and doing a count before apply filter conditions,
but that did not improve persist performance:
https://stackoverflow.com/questions/46101585/persist-slower-than-non-persist-calls
   
    Would appreciate if someone could mention how persist works, in what
cases
    it is faster than not-persisting and more importantly, how to go about
    troubleshooting out of memory issues.
   



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message