spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zh8788 <78343...@qq.com>
Subject How to keep a local variable in each cluster?
Date Mon, 24 Nov 2014 01:41:37 GMT
Hi,

I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:

Since the training data(label, feature) is large, so I created a RDD and
cached the training data(label, feature ) in memory.  Then for ADMM, it
needs to keep  local parameters (u,v) (which are different for each
partition ). For each iteration, I need to use the training data(only on
that partition), u, v to calculate the new value for u and v. 

Question1:

One way is to zip (training data, u, v) into a rdd and update it in each
iteration, but as we can see, training data is large and won't change for
the whole time, only u, v (is small) are changed in each iteration. If I zip
these three, I could not cache that rdd (since it changed for every
iteration). But if did not cache that, I need to reuse the training data
every iteration, how could I do it?

Question2:

Related to Question1, on the online documents, it said if we don't cache the
rdd, it  will not in the memory. And rdd uses delayed operation, then I am
confused when can I view a previous rdd in memroy.

Case1:

B = A.map(function1).
B.collect()    #This forces B to be calculated ? After that, the node just
release B since it is not cached ???   
D = B.map(function3) 
D.collect()

Case2:
B = A.map(function1).
D = B.map(function3)   
D.collect()

Case3:

B = A.map(function1).
C = A.map(function2)
D = B.map(function3) 
D.collect()
 
In which case, can I view  B is in memory in each cluster when I calculate
D?

Question3:

can I use a function to do operations on two rdds? 

E.g   Function newfun(rdd1, rdd2)  
#rdd1 is large and do not change for the whole time (training data), which I
can use cache
#rdd2 is small and change in each iteration (u, v )


Questions4:

Or are there other ways to solve this kind of problem? I think this is
common problem, but I could not find any good solutions.


Thanks a lot

Han 











--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message