spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian.t" <>
Subject Memory requirement of using Spark
Date Fri, 24 Oct 2014 03:17:12 GMT
I am new to Spark. I have a basic question about the memory requirement of
using Spark. 

I need to join multiple data sources between multiple data sets. The join is
not a straightforward join. The logic is more like: first join T1 on column
A with T2, then for all the records that couldn't find the match in the
Join, join T1 on column B with T2, and then join on C and son on. I was
using HIVE, but it requires multiple scans on T1, which turns out slow.

It seems like if I load T1 and T2 in memory using Spark, I could improve the
performance. However, T1 and T2 totally is around 800G. Does that mean I
need to have 800G memory (I don't have that amount of memory)? Or Spark
could do something like streaming but then again will the performance
sacrifice as a result?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message