spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian.t" <jian.tan...@gmail.com>
Subject Memory requirement of using Spark
Date Fri, 24 Oct 2014 03:17:12 GMT
Hello, 
I am new to Spark. I have a basic question about the memory requirement of
using Spark. 

I need to join multiple data sources between multiple data sets. The join is
not a straightforward join. The logic is more like: first join T1 on column
A with T2, then for all the records that couldn't find the match in the
Join, join T1 on column B with T2, and then join on C and son on. I was
using HIVE, but it requires multiple scans on T1, which turns out slow.

It seems like if I load T1 and T2 in memory using Spark, I could improve the
performance. However, T1 and T2 totally is around 800G. Does that mean I
need to have 800G memory (I don't have that amount of memory)? Or Spark
could do something like streaming but then again will the performance
sacrifice as a result?



Thanks
JT



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-requirement-of-using-Spark-tp17177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message