spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Peysakhov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4
Date Tue, 26 Feb 2019 16:44:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ilya Peysakhov updated SPARK-26996:
-----------------------------------
    Description: 
Spark 2.4 reports an error when querying a dataframe that has only 1 row and 1 column (scalar
subquery). 

 

Reproducer is below. No other data is needed to reproduce the error.

This will write a table of dates and strings, write another "fact" table of ints and dates,
then read both tables as views and filter the "fact" based on the max(date) from the first
table. This is done within spark-shell in spark 2.4 vanilla (also reproduced in AWS EMR 5.20.0)

-------------------------

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL select '2018-01-02',
'source2' UNION ALL select '2018-01-03' , 'source3' UNION ALL select '2018-01-04' ,'source4'
").write.mode("overwrite").save("/latest_dates")
 val mydatetable = spark.read.load("/latest_dates")
 mydatetable.createOrReplaceTempView("latest_dates")

spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, '2018-01-02' UNION
ALL select 300, '2018-01-03' UNION ALL select 3444, '2018-01-01' UNION ALL select 600, '2018-08-30'
").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
 val source1 = spark.read.load("/mypartitioneddata")
 source1.createOrReplaceTempView("source1")

spark.sql("select max(date), 'source1' as category from source1 where date >= (select latest_date
from latest_dates where source='source1') ").show

 ----------------------------

 

Error summary

—

java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#35 []
 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
 at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)

-------

This reproducer works in previous versions (2.3.2, 2.3.1, etc).

 

  was:
Spark 2.4 reports an error when querying a dataframe that has only 1 row (scalar subquery). 

 

Reproducer is below. No other data is needed to reproduce the error.

This will write a table of dates and strings, write another "fact" table of ints and dates,
then read both tables as views and filter the "fact" based on the max(date) from the first
table. This is done within spark-shell in spark 2.4 vanilla (also reproduced in AWS EMR 5.20.0)

-------------------------

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL select '2018-01-02',
'source2' UNION ALL select '2018-01-03' , 'source3' UNION ALL select '2018-01-04' ,'source4'
").write.mode("overwrite").save("/latest_dates")
val mydatetable = spark.read.load("/latest_dates")
mydatetable.createOrReplaceTempView("latest_dates")

spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, '2018-01-02' UNION
ALL select 300, '2018-01-03' UNION ALL select 3444, '2018-01-01' UNION ALL select 600, '2018-08-30'
").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
val source1 = spark.read.load("/mypartitioneddata")
source1.createOrReplaceTempView("source1")

spark.sql("select max(date), 'source1' as category from source1 where date >= (select latest_date
from latest_dates where source='source1') ").show

 ----------------------------

 

Error summary

---

java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#35 []
 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
 at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)

-------

This reproducer works in previous versions (2.3.2, 2.3.1, etc).

 


> Scalar Subquery not handled properly in Spark 2.4 
> --------------------------------------------------
>
>                 Key: SPARK-26996
>                 URL: https://issues.apache.org/jira/browse/SPARK-26996
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Ilya Peysakhov
>            Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 1 column
(scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of ints and
dates, then read both tables as views and filter the "fact" based on the max(date) from the
first table. This is done within spark-shell in spark 2.4 vanilla (also reproduced in AWS
EMR 5.20.0)
> -------------------------
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL select '2018-01-02',
'source2' UNION ALL select '2018-01-03' , 'source3' UNION ALL select '2018-01-04' ,'source4'
").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, '2018-01-02'
UNION ALL select 300, '2018-01-03' UNION ALL select 3444, '2018-01-01' UNION ALL select 600,
'2018-08-30' ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= (select
latest_date from latest_dates where source='source1') ").show
>  ----------------------------
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#35
[]
>  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> -------
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message