spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kalin Stoyanov <kgs.v...@gmail.com>
Subject Re: full SQL query graph not shown in monitoring when using cache
Date Thu, 15 Apr 2021 13:37:39 GMT
Hi Mohamadreza,

This is not happening for me with the example I showed - there is just this
one SQL query logged, and everything before the InMemoryTableScan from the
physical plan is not present in the graph above it. Here's the log itself
if you want to see it.

Regards,
Kalin

On Thu, Apr 15, 2021 at 4:26 PM Mohamadreza Rostami <
mohamadrezarostami2@gmail.com> wrote:

> Hi
> When you a DataFrame cached, the first time you call action, such as SQL
> query, on that DataFrame, you could see all of the transformations are run.
> Still, in the next action calls, these transformations cached and spark run
> only transformations that write after the cache. This is the meaning of the
> cache in Spark.
>
> On Farvardin 26, 1400 AP, at 17:24, Kalin Stoyanov <kgs.void@gmail.com>
> wrote:
>
> Hi all,
>
> I noticed something a bit strange.. When working with a cached DF, the SQL
> query details graph starts from when the cache takes place, and doesn't
> show the transformations before it. For example this code
>
> >>> df = sc.parallelize([[1,2,3],[1,4,5]]).toDF(['id','a','b'])
> >>> renameCols = [f"`{col}` as `{col}_other`" for col in df.columns]
> >>> df_cart = df.crossJoin(df.selectExpr(renameCols))
> >>> df = df_cart.groupBy("id").sum("a", "b")
> >>> df = df.cache()
> >>> df = df.selectExpr("id", "`sum(a)` * 2 as a", "`sum(b)` * 2 as b")
> >>> df.show()
>
> produces 1 query with this physical plan
>
> == Physical Plan ==
> CollectLimit 21
> +- *(1) Project [cast(id#0L as string) AS id#58, cast((sum(a)#26L * 2) as string) AS
a#59, cast((sum(b)#27L * 2) as string) AS b#60]
>    +- *(1) ColumnarToRow
>       +- InMemoryTableScan [id#0L, sum(a)#26L, sum(b)#27L]
>             +- InMemoryRelation [id#0L, sum(a)#26L, sum(b)#27L], StorageLevel(disk, memory,
deserialized, 1 replicas)
>                   +- *(4) HashAggregate(keys=[id#0L], functions=[sum(a#1L), sum(b#2L)],
output=[id#0L, sum(a)#26L, sum(b)#27L])
>                      +- Exchange hashpartitioning(id#0L, 4), true, [id=#30]
>                         +- *(3) HashAggregate(keys=[id#0L], functions=[partial_sum(a#1L),
partial_sum(b#2L)], output=[id#0L, sum#33L, sum#34L])
>                            +- CartesianProduct
>                               :- *(1) Scan ExistingRDD[id#0L,a#1L,b#2L]
>                               +- *(2) Project
>                                  +- *(2) Scan ExistingRDD[id#0L,a#1L,b#2L]
>
> But the visual graph representation is just
>
> <image.png>
> Is this something that's done on purpose? I'd rather see the whole
> thing... This is on Spark 3.0.1.
>
>
>

Mime
View raw message