spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Chen <baibaic...@gmail.com>
Subject [DISCUSS] Caching SparkPlan
Date Sat, 01 Feb 2020 06:48:30 GMT
I'd like to start a discussion on caching SparkPlan

>From what I benchmark, if sql execution time is less than 1 second, then we
cannot ignore the following overheads , especially if we cache data in
memory

   1. Paring, analysing, optimizing SQL
   2. Generating Physical Plan (SparkPlan)
   3. Generating Codes
   4. Generating RDD

snappydata(https://github.com/SnappyDataInc/snappydata) already implemented
plan cache for same query pattern, for example ,given the sql: SELECT
intField, count(*), sum(intField) FROM t WHERE intField = 1 group by
intField, for any sql which is only difference on Literal, we can re-use
the same SparkPlan. snappydata's implementation is based on caching the
final RDD graph, since spark will cache shuffle data internally, I believe
it can not run the same  query pattern concurrently.

My idea is caching SparkPlan(and hence, caching generated codes), and
creating RDD graph every time(the only overhead we can not ignore, but if
we read data from memory, the overhead is OK). I did some experiment by
extracting ParamLiteral from snappydata, and the experiment looks fine.

I want to get some comments before writing a jira.  Would appreciate
comments and discussions from the community.

Thanks
Chang

Mime
View raw message