spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gen tang <gen.tan...@gmail.com>
Subject Re: Questions about SparkSQL join on not equality conditions
Date Mon, 10 Aug 2015 15:29:47 GMT
Hi,

I am sorry to bother again.
When I do join as follow:
df = sqlContext.sql("selet a.someItem, b.someItem from a full outer join b
on condition1 *or* condition2")
df.first()

The program failed at the result size is bigger than
spark.driver.maxResultSize.
It is really strange, as one record is no way bigger than 1G.
When I do join on just one condition or equity condition, there will be no
problem.

Could anyone help me, please?

Thanks a lot in advance.

Cheers
Gen


On Sun, Aug 9, 2015 at 9:08 PM, gen tang <gen.tang86@gmail.com> wrote:

> Hi,
>
> I might have a stupid question about sparksql's implementation of join on
> not equality conditions, for instance condition1 or condition2.
>
> In fact, Hive doesn't support such join, as it is very difficult to
> express such conditions as a map/reduce job. However, sparksql supports
> such operation. So I would like to know how spark implement it.
>
> As I observe such join runs very slow, I guess that spark implement it by
> doing filter on the top of cartesian product. Is it true?
>
> Thanks in advance for your help.
>
> Cheers
> Gen
>
>
>

Mime
View raw message