You can use spark-sql to solve this usecase, and you don't need to have 800G of memory (but of course if you are caching the whole data into memory, then you would need it.). You can persist the data by setting DISK_AND_MEMORY_SER property if you don't want to bring whole data into memory, in this case most of the data would reside on the disk and spark will utilize it efficiently.

Best Regards

On Fri, Oct 24, 2014 at 8:47 AM, jian.t <> wrote:
I am new to Spark. I have a basic question about the memory requirement of
using Spark.

I need to join multiple data sources between multiple data sets. The join is
not a straightforward join. The logic is more like: first join T1 on column
A with T2, then for all the records that couldn't find the match in the
Join, join T1 on column B with T2, and then join on C and son on. I was
using HIVE, but it requires multiple scans on T1, which turns out slow.

It seems like if I load T1 and T2 in memory using Spark, I could improve the
performance. However, T1 and T2 totally is around 800G. Does that mean I
need to have 800G memory (I don't have that amount of memory)? Or Spark
could do something like streaming but then again will the performance
sacrifice as a result?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail: