spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo <yanboha...@gmail.com>
Subject Re: How to keep a local variable in each cluster?
Date Mon, 24 Nov 2014 16:17:49 GMT


发自我的 iPad

> 在 2014年11月24日,上午9:41,zh8788 <78343224@qq.com> 写道:
> 
> Hi,
> 
> I am new to spark. This is the first time I am posting here. Currently, I
> try to implement ADMM optimization algorithms for Lasso/SVM
> Then I come across a problem:
> 
> Since the training data(label, feature) is large, so I created a RDD and
> cached the training data(label, feature ) in memory.  Then for ADMM, it
> needs to keep  local parameters (u,v) (which are different for each
> partition ). For each iteration, I need to use the training data(only on
> that partition), u, v to calculate the new value for u and v. 
> 
RDD has a transform named mapPartitions(), it runs separately on each partition of RDD.
> Question1:
> 
> One way is to zip (training data, u, v) into a rdd and update it in each
> iteration, but as we can see, training data is large and won't change for
> the whole time, only u, v (is small) are changed in each iteration. If I zip
> these three, I could not cache that rdd (since it changed for every
> iteration). But if did not cache that, I need to reuse the training data
> every iteration, how could I do it?
> 
> Question2:
> 
> Related to Question1, on the online documents, it said if we don't cache the
> rdd, it  will not in the memory. And rdd uses delayed operation, then I am
> confused when can I view a previous rdd in memroy.
> 
> Case1:
> 
> B = A.map(function1).
> B.collect()    #This forces B to be calculated ? After that, the node just
> release B since it is not cached ???   
> D = B.map(function3) 
> D.collect()
> 
> Case2:
> B = A.map(function1).
> D = B.map(function3)   
> D.collect()
> 
> Case3:
> 
> B = A.map(function1).
> C = A.map(function2)
> D = B.map(function3) 
> D.collect()
> 
> In which case, can I view  B is in memory in each cluster when I calculate
> D?
> 
If you want a certain RDD store in memory, use RDD.persistent(MEMORY_ONLY).
Spark automatically monitors cache usage on each node and drops out old data partitions in
a least-recently-used (LRU) fashion.
> Question3:
> 
> can I use a function to do operations on two rdds? 
Yes, but it can only be executed in driver.
> 
> E.g   Function newfun(rdd1, rdd2)  
> #rdd1 is large and do not change for the whole time (training data), which I
> can use cache
> #rdd2 is small and change in each iteration (u, v )
> 
> 
> Questions4:
> 
> Or are there other ways to solve this kind of problem? I think this is
> common problem, but I could not find any good solutions.
> 
> 
> Thanks a lot
> 
> Han 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message