spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gengliang Wang <gengliang.w...@databricks.com>
Subject Re: Inner join with the table itself
Date Mon, 15 Jan 2018 10:27:23 GMT
Hi Michael,

You can use `Explain` to see how your query is optimized. 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html <https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html>
I believe your query is an actual cross join, which is usually very slow in execution.

To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.


> 在 2018年1月15日,下午6:09,Jacek Laskowski <jacek@japila.pl> 写道:
> 
> Hi Michael,
> 
> -dev +user
> 
> What's the query? How do you "fool spark"?
> 
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski <https://about.me/JacekLaskowski>
> Mastering Spark SQL https://bit.ly/mastering-spark-sql <https://bit.ly/mastering-spark-sql>
> Spark Structured Streaming https://bit.ly/spark-structured-streaming <https://bit.ly/spark-structured-streaming>
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams <https://bit.ly/mastering-kafka-streams>
> Follow me at https://twitter.com/jaceklaskowski
>  <https://twitter.com/jaceklaskowski>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <mshtelma@gmail.com <mailto:mshtelma@gmail.com>>
wrote:
> Hi all,
> 
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
> 
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
> 
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
> 
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
> 
> Thanks,
> Michael
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
> 
> 


Mime
View raw message