spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
Date Fri, 28 Sep 2018 19:55:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632485#comment-16632485
] 

Sean Owen commented on SPARK-25150:
-----------------------------------

Hm, I am not sure I understand the example yet – help me clarify here. We have three dataframes,
really; states, humans, zombies:

 
{code:java}
State,Total Population,Total Area
RI,1200000,300000
MA,8000000,17000000
NH,3300000,9100000

+-----+-----+
|State|count|
+-----+-----+
|   RI|    2|
|   NH|    1|
+-----+-----+

+-----+-----+
|State|count|
+-----+-----+
|   RI|    1|
|   MA|    1|
+-----+-----+{code}
You join all three on state:
{code:java}
analysis = (
        states
        .join(
            total_humans,
            on=(states['State'] == total_humans['State'])
        )
        .join(
            total_zombies,
            on=(states['State'] == total_zombies['State'])
        )
        .orderBy(states['State'], ascending=True)
        .select(
            states['State'],
            states['Total Population'],
            total_humans['count'].alias('Total Humans'),
            total_zombies['count'].alias('Total Zombies'),
        )
    )
{code}
and you get
{code:java}
+-----+----------------+------------+-------------+
|State|Total Population|Total Humans|Total Zombies|
+-----+----------------+------------+-------------+
|   NH|         3300000|           1|            1|
|   NH|         3300000|           1|            1|
|   RI|         1200000|           2|            1|
|   RI|         1200000|           2|            1|
+-----+----------------+------------+-------------+{code}
But say you expect
{code:java}
+-----+----------------+------------+-------------+
|State|Total Population|Total Humans|Total Zombies|
+-----+----------------+------------+-------------+
|   RI|         1200000|           2|            1|
+-----+----------------+------------+-------------+{code}
 

First, this isn't a cross join right? the message says it thinks there is no join condition
and wonders if you're really trying to do a cross join, but you're not, so enabling it isn't
helping. If these were cross-joins, the output would be correct I think?

The second join joins on a column in {{states}}, but that is not a DataFrame used in that
join. Is that the problem?

 

> Joining DataFrames derived from the same source yields confusing/incorrect results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Nicholas Chammas
>            Priority: Major
>         Attachments: expected-output.txt, output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt,
persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1
and B2. When joining A to B1 and B2, I'm getting a very confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears
to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of bug here.
The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly
incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and without the implicit
cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should be left
outer join instead of an inner join (since some of the aggregates are not available for all
states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message