spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashic Mahtab <as...@live.com>
Subject RE: Spark join and large temp files
Date Mon, 08 Aug 2016 18:53:52 GMT
Hi Deepak,Thanks for the response. 
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id", "name").withColumnRenamed("eid.id",
"id")val b = sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmca05@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: ashic@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp table.This should
resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <ashic@live.com> wrote:



Hello,We have two parquet inputs of the following form:
a: id:String, Name:String  (1.5TB)b: id:String, Number:Int  (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the rdds, mapping them to pair rdds with
id as the key, and then joining. What we're seeing is that temp file usage is increasing on
the join stage, and filling up our disks, causing the job to crash. Is there a way to join
these two data sets without well...crashing?
Note, the ids are unique, and there's a one to one mapping between the two datasets. 
Any help would be appreciated.
-Ashic. 



 		 	   		  


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net
 		 	   		  
Mime
View raw message