spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Valle <marcelo.va...@ktech.com>
Subject join with just 1 record causes all data to go to a single node
Date Thu, 21 Nov 2019 15:51:14 GMT
Hi,

I am using spark on EMR 5.28.0.

We were having a problem in production where, after a join between 2
dataframes, in some situations all data was being moved to a single node,
and then the cluster was failing after retrying many times.

Our join is something like that:

```

df1.join(df2,
  df1("field1") <=> df2("field1")
    && df1("field2") <=> df2("field2"))

```

After some harvesting, we were able to isolate the corner case - this was
only happening when all join fields were NULL. Notice the `<=>` operator
instead of `===`.

Would someone be able to explain this behavior? It looks like a bug to me,
but I could be missing something.

Thanks,
Marcelo.

This email is confidential [and may be protected by legal privilege]. If you are not the intended
recipient, please do not copy or disclose its content but contact the sender immediately upon
receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom

Mime
View raw message