spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Monga (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))
Date Fri, 04 May 2018 07:32:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463494#comment-16463494
] 

Ajay Monga commented on SPARK-24177:
------------------------------------

The suspicion has been strengthened by the fact that when the query is written in such a fashion
that the date shift logic is put into the SELECT clause and then the join is done, the result
is correct and consistent across runs.

> Spark returning inconsistent rows and data in a join query when run using Spark SQL (using
SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24177
>                 URL: https://issues.apache.org/jira/browse/SPARK-24177
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Production
>            Reporter: Ajay Monga
>            Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns different rows
and the value of the column on which a simple multiplication takes place returns different
values:
> The query is like:
> SELECT
> second_table.date_value, SUM(XXX * second_table.shift_value)
> FROM
> (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
> )
> intermediate LEFT OUTER
> JOIN second_table ON second_table.date_value = (<Logic to change the 'date_value'
from first table, say if it's a Saturday or Sunday then use Monday, else next valid working
date>)
> AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
> )
> GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - one for first_table
and other for second_table before joining. Then the result gets split across partitions,
seemingly grouped/distributed by the join column, which is 'date_value'. In the join there
is a date shift logic that fails to join in some cases when it should, primarily for the
date_values at the edge of the partitions across the Spark cluster. So, it's dependent on
how the data (or the rdd) of the individual queries is partitioned in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message