spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CC GP <chandrika.gopalakris...@gmail.com>
Subject Re: Force inner join to shuffle the smallest table
Date Wed, 24 Jun 2015 19:07:53 GMT
Try below and see if it makes a difference:

val result = sqlContext.sql(“select big.f1, big.f2 from small inner join
big on big.s=small.s and big.d=small.d”)

On Wed, Jun 24, 2015 at 11:35 AM, Ulanov, Alexander <alexander.ulanov@hp.com
> wrote:

>  Hi,
>
>
>
> I try to inner join of two tables on two fields(string and double). One
> table is 2B rows, the second is 500K. They are stored in HDFS in Parquet.
> Spark v 1.4.
>
> val big = sqlContext.paquetFile(“hdfs://big”)
>
> data.registerTempTable(“big”)
>
> val small = sqlContext.paquetFile(“hdfs://small”)
>
> data.registerTempTable(“small”)
>
> val result = sqlContext.sql(“select big.f1, big.f2 from big inner join
> small on big.s=small.s and big.d=small.d”)
>
>
>
> This query fails in the middle due to one of the workers “disk out of
> space” with shuffle reported 1.8TB which is the maximum size of my spark
> working dirs (on total 7 worker nodes). This is surprising, because the
> “big” table takes 2TB disk space (unreplicated) and “small” about 5GB and I
> would expect that optimizer will shuffle the small table. How to force
> Spark to shuffle the small table? I tried to write “small inner join big”
> however it also fails with 1.8TB of shuffle.
>
>
>
> Best regards, Alexander
>
>
>

Mime
View raw message