spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Shtelma <mshte...@gmail.com>
Subject Re: Inner join with the table itself
Date Mon, 15 Jan 2018 14:26:47 GMT
Hi Jacek & Gengliang,

let's take a look at the following query:

val pos = spark.read.parquet(prefix + "POSITION.parquet")
pos.createOrReplaceTempView("POSITION")
spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
").collect()

This query is working for me right now using spark 2.2.

Now we can try implementing the same logic with DataFrame API:

pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()

I am getting the following error:

"Join condition is missing or trivial.

Use the CROSS JOIN syntax to allow cartesian products between these relations.;"

I have tried using alias function, but without success:

val pos2 = pos.alias("P2")
pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()

This also leads us to the same error.
Am  I missing smth about the usage of alias?

Now let's rename the columns:

val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()

It works!

There is one more really odd thing about all this: a colleague of mine
has managed to get the same exception ("Join condition is missing or
trivial") also using original SQL query, but I think he has been using
empty tables.

Thanks,
Michael


On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
<gengliang.wang@databricks.com> wrote:
> Hi Michael,
>
> You can use `Explain` to see how your query is optimized.
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
> I believe your query is an actual cross join, which is usually very slow in
> execution.
>
> To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>
>
> 在 2018年1月15日,下午6:09,Jacek Laskowski <jacek@japila.pl> 写道:
>
> Hi Michael,
>
> -dev +user
>
> What's the query? How do you "fool spark"?
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <mshtelma@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> If I try joining the table with itself using join columns, I am
>> getting the following error:
>> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> allow cartesian products between these relations.;"
>>
>> This is not true, and my join is not trivial and is not a real cross
>> join. I am providing join condition and expect to get maybe a couple
>> of joined rows for each row in the original table.
>>
>> There is a workaround for this, which implies renaming all the columns
>> in source data frame and only afterwards proceed with the join. This
>> allows us to fool spark.
>>
>> Now I am wondering if there is a way to get rid of this problem in a
>> better way? I do not like the idea of renaming the columns because
>> this makes it really difficult to keep track of the names in the
>> columns in result data frames.
>> Is it possible to deactivate this check?
>>
>> Thanks,
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message