spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-19093) Cached tables are not used in SubqueryExpression
Date Sun, 08 Jan 2017 22:10:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Herman van Hovell resolved SPARK-19093.
---------------------------------------
       Resolution: Fixed
         Assignee: Dilip Biswal
    Fix Version/s: 2.2.0

> Cached tables are not used in SubqueryExpression
> ------------------------------------------------
>
>                 Key: SPARK-19093
>                 URL: https://issues.apache.org/jira/browse/SPARK-19093
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Josh Rosen
>            Assignee: Dilip Biswal
>             Fix For: 2.2.0
>
>
> See reproduction at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html
> Consider the following:
> {code}
> Seq(("a", "b"), ("c", "d"))
>   .toDS
>   .write
>   .parquet("/tmp/rows")
> val df = spark.read.parquet("/tmp/rows")
> df.cache()
> df.count()
> df.createOrReplaceTempView("rows")
> spark.sql("""
>   select * from rows cross join rows
> """).explain(true)
> spark.sql("""
>   select * from rows where not exists (select * from rows)
> """).explain(true)
> {code}
> In both plans, I'd expect that both sides of the joins would read from the cached table
for both the cross join and anti join, but the left anti join produces the following plan
which only reads the left side from cache and reads the right side via a regular non-cahced
scan:
> {code}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT exists#3994
>    :  +- 'Project [*]
>    :     +- 'UnresolvedRelation `rows`
>    +- 'UnresolvedRelation `rows`
> == Analyzed Logical Plan ==
> _1: string, _2: string
> Project [_1#3775, _2#3776]
> +- Filter NOT predicate-subquery#3994 []
>    :  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>    :     +- Project [_1#3775, _2#3776]
>    :        +- SubqueryAlias rows
>    :           +- Relation[_1#3775,_2#3776] parquet
>    +- SubqueryAlias rows
>       +- Relation[_1#3775,_2#3776] parquet
> == Optimized Logical Plan ==
> Join LeftAnti
> :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas)
> :     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>    +- Relation[_1#3775,_2#3776] parquet
> == Physical Plan ==
> BroadcastNestedLoopJoin BuildRight, LeftAnti
> :- InMemoryTableScan [_1#3775, _2#3776]
> :     +- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas)
> :           +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- BroadcastExchange IdentityBroadcastMode
>    +- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>       +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message