spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-12586) Wrong answer with registerTempTable and union sql query
Date Tue, 10 Jan 2017 06:37:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-12586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-12586.
----------------------------------
    Resolution: Not A Problem

I just ran the codes you attached and it prints as below:

{code}
+-------+
|v_value|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+

+---+---+---+---+-----+
|row|col|foo|bar|value|
+---+---+---+---+-----+
|  3|  1|  1|  1| null|
|  2|  1|  1|  1|    3|
|  3|  2|  1|  1| null|
|  3|  3|  1|  1|    2|
|  3|  4|  1|  2| null|
+---+---+---+---+-----+

Traceback (most recent call last):
  File "/Users/hyukjinkwon/Desktop/workspace/repos/forked/spark/sql_bug.py", line 52, in <module>
    result.show()
  File "./python/pyspark/sql/dataframe.py", line 318, in show
    print(self._jdf.showString(n, 20))
  File "/Users/hyukjinkwon/Desktop/workspace/repos/forked/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in __call__
  File "./python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical
plans\nProject [v1#36L]\n+- Filter isnull(v2#37L)\n   +- Join LeftOuter, (v1#36L = v2#37L)\n
     :- Project [v_value#0L AS v1#36L]\n      :  +- LogicalRDD [v_value#0L]\n      +- Aggregate
[row#7L, col#8L, foo#9L, bar#10L, value#28L], [value#28L AS v2#37L]\n         +- Union\n 
          :- Project [row#7L, col#8L, foo#9L, bar#10L, 8 AS value#28L]\n            :  +-
Filter (((isnotnull(row#7L) && isnotnull(col#8L)) && ((row#7L = 1) &&
(col#8L = 2))) && (((isnotnull(bar#10L) && isnotnull(foo#9L)) && (foo#9L
= 1)) && (bar#10L = 2)))\n            :     +- LogicalRDD [row#7L, col#8L, foo#9L,
bar#10L, value#11L]\n            +- Filter ((NOT (row#7L = 1) || NOT (col#8L = 2)) &&
((((isnotnull(bar#10L) && isnotnull(foo#9L)) && (foo#9L = 1)) && (bar#10L
= 2)) && isnotnull(value#11L)))\n               +- LogicalRDD [row#7L, col#8L, foo#9L,
bar#10L, value#11L]\nand\nAggregate [row#7L, col#8L, foo#9L, bar#10L, value#28L], [row#7L,
col#8L]\n+- Union\n   :- Project [row#7L, col#8L, foo#9L, bar#10L, 8 AS value#28L]\n   : 
+- LocalRelation <empty>, [row#7L, col#8L, foo#9L, bar#10L, value#11L]\n   +- Filter
((NOT (row#7L = 1) || NOT (col#8L = 2)) && ((((isnotnull(bar#10L) && isnotnull(foo#9L))
&& (foo#9L = 1)) && (bar#10L = 2)) && isnull(value#11L)))\n      +-
LogicalRDD [row#7L, col#8L, foo#9L, bar#10L, value#11L]\nJoin condition is missing or trivial.\nUse
the CROSS JOIN syntax to allow cartesian products between these relations.;'
{code}

The behaviour was changed and it seems this JIRA is obsolete. I am resolving this as {{Not
A Problem}}. Please reopen this if anyone feels this is an inappropriate action or I was wrong.


> Wrong answer with registerTempTable and union sql query
> -------------------------------------------------------
>
>                 Key: SPARK-12586
>                 URL: https://issues.apache.org/jira/browse/SPARK-12586
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>         Environment: Windows 7
>            Reporter: shao lo
>         Attachments: sql_bug.py
>
>
> The following sequence of sql(), registerTempTable() calls gets the wrong answer.
> The correct answer is returned if the temp table is rewritten?
> {code}
>     sql_text = """select row, col, foo, bar, value2 value
>                 from (select row, col, foo, bar, 8 value2 from t0 where row=1 and col=2)
s1
>                       union select row, col, foo, bar, value from t0 where not (row=1
and col=2)"""
>     df2 = sqlContext.sql(sql_text)
>     df2.registerTempTable("t1")
>     # # The following 2 line workaround fixes the problem somehow?
>     # df3 = sqlContext.createDataFrame(df2.collect())
>     # df3.registerTempTable("t1")
>     # # The following 4 line workaround fixes the problem too..but takes way longer
>     # filename = "t1.json"
>     # df2.write.json(filename, mode='overwrite')
>     # df3 = sqlContext.read.json(filename)
>     # df3.registerTempTable("t1")
>     sql_text2 = """select row, col, v1 value from
>             (select v1 from
>                 (select v_value v1 from values) s1
>                   left join
>                 (select value v2,foo,bar,row,col from t1
>                   where foo=1
>                     and bar=2 and value is not null) s2
>                   on v1=v2 where v2 is null
>             ) sa join
>             (select row, col from t1 where foo=1
>                     and bar=2 and value is null) sb"""
>     result = sqlContext.sql(sql_text2)
>     result.show()
>     
>     # Expected result
>     # +---+---+-----+
>     # |row|col|value|
>     # +---+---+-----+
>     # |  3|  4|    1|
>     # |  3|  4|    2|
>     # |  3|  4|    3|
>     # |  3|  4|    4|
>     # +---+---+-----+    
>     # Getting this wrong result...when not using the workarounds above
>     # +---+---+-----+
>     # |row|col|value|
>     # +---+---+-----+
>     # +---+---+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message