spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject RE: Force inner join to shuffle the smallest table
Date Wed, 24 Jun 2015 19:27:15 GMT
It also fails, as I mentioned in the original question.

From: CC GP [mailto:chandrika.gopalakrishna@gmail.com]
Sent: Wednesday, June 24, 2015 12:08 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Force inner join to shuffle the smallest table

Try below and see if it makes a difference:

val result = sqlContext.sql(“select big.f1, big.f2 from small inner join big on big.s=small.s
and big.d=small.d”)

On Wed, Jun 24, 2015 at 11:35 AM, Ulanov, Alexander <alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>
wrote:
Hi,

I try to inner join of two tables on two fields(string and double). One table is 2B rows,
the second is 500K. They are stored in HDFS in Parquet. Spark v 1.4.
val big = sqlContext.paquetFile(“hdfs://big”)
data.registerTempTable(“big”)
val small = sqlContext.paquetFile(“hdfs://small”)
data.registerTempTable(“small”)
val result = sqlContext.sql(“select big.f1, big.f2 from big inner join small on big.s=small.s
and big.d=small.d”)

This query fails in the middle due to one of the workers “disk out of space” with shuffle
reported 1.8TB which is the maximum size of my spark working dirs (on total 7 worker nodes).
This is surprising, because the “big” table takes 2TB disk space (unreplicated) and “small”
about 5GB and I would expect that optimizer will shuffle the small table. How to force Spark
to shuffle the small table? I tried to write “small inner join big” however it also fails
with 1.8TB of shuffle.

Best regards, Alexander


Mime
View raw message