spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Peng <kpe...@gmail.com>
Subject Re: Weird results with Spark SQL Outer joins
Date Mon, 02 May 2016 21:13:43 GMT
Yong,

Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.

>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts.  They are basically pulling
out selected columns from the query, but there is no roll up happening or
anything that would possible make it suspicious that there is any
difference besides the type of joins.  The tables are matched based on
three keys that are present in both tables (ad, account, and date), on top
of this they are filtered by date being above 2016-01-03.  Since all the
joins are producing the same counts, the natural suspicions is that the
tables are identical, but I when I run the following two queries:

scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>='2016-01-03'").count

res14: Long = 34158

scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>='2016-01-03'").count

res15: Long = 42693


The above two queries filter out the data based on date used by the joins
of 2016-01-03 and you can see the row count between the two tables are
different, which is why I am suspecting something is wrong with the outer
joins in spark sql, because in this situation the right and outer joins may
produce the same results, but it should not be equal to the left join and
definitely not the inner join; unless I am missing something.


Side note: In my haste response above I posted the wrong counts for
dps.count, the real value is res16: Long = 42694


Thanks,


KP



On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8964@hotmail.com> wrote:

> We are still not sure what is the problem, if you cannot show us with some
> example data.
>
> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
> with swig on 3 columns; with additional filters, get the same resultSet row
> count as dps lefter outer join with swig on 3 columns, with additional
> filters, again get the the same resultSet row count as dps right outer join
> with swig on 3 columns, with same additional filters.
>
> Without knowing your data, I cannot see the reason that has to be a bug in
> the spark.
>
> Am I misunderstanding your bug?
>
> Yong
>
> ------------------------------
> From: kpeng1@gmail.com
> Date: Mon, 2 May 2016 12:11:18 -0700
> Subject: Re: Weird results with Spark SQL Outer joins
> To: gourav.sengupta@gmail.com
> CC: user@spark.apache.org
>
>
> Gourav,
>
> I wish that was case, but I have done a select count on each of the two
> tables individually and they return back different number of rows:
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
>
> dps.count()
> RESULT: 42632
>
>
> swig.count()
> RESULT: 42034
>
> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpeng1@gmail.com> wrote:
>
> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>

Mime
View raw message