spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Ayoub <>
Subject RE: Modifying an RDD in forEach
Date Sat, 06 Dec 2014 21:42:45 GMT
These are very interesting comments. The vast majority of cases I'm working on are going to
be in the 3 million range and 100 million was thrown out as something to shoot for. I upped
it to 500 million. But all things considering, I believe I may be able to directly translate
what I have to Java Streams API and run 100 million docs on 32 cores in under an hour or two
which would suit our needs. Up until this point I've been focused on computational aspect

If I can scale up to clustering 100 million documents on a single machine I can probably directly
translate what I have to Java Streams API and be faster. It is that scaling out that changes
things. These are interesting comments. I think in this hierarchical k-means case the lazy
evaluation becomes almost useless and perhaps even an impediment. Part of the problem is that
I've been a bit too focused on math/information retrieval and have to update a bit on functional
approach to programming so I can better utilize the tools But it does appear that Spark may
not be the best option for this need. I don't need resiliency or fault tolerance as much as
I need to be able to execute an algorithm on a large amount of data fast and then be done
with it. I'm now thinking that in the 100 million document range I may be ok clustering feature
vectors with no more than 25 features per doc on a single machine with 32 cores and a load
of memory. I might directly translate what I have to Java 8 Streams API. 
There is also questions of proportion. Perhaps what I have is not big enough to warrant or
require scaling out. I may have other uses for Spark in traditional map-reduce algorithms
such as counting pairs of shared shingles for near dupe detection but to this point I've found
Oracles parallel-pipelined table functions, while not glamorous are doing quite well in DB.

I'm just a bit confused still on why it is advertised ideal for iterative algorithms when
iterative algorithms have that point per iteration where things do get evaluated and laziness
is not terribly useful. Ideal for massive in-memory cluster computing yes - but iterative...
? not sure. I have that book "Functional Programming in Scala" and I hope to read it someday
and enrich my understanding here. 

Subject: Re: Modifying an RDD in forEach
Date: Sat, 6 Dec 2014 13:13:50 -0800

Ron,“appears to be working” might be true when there are no failures. on large datasets
being processed on a large number of machines, failures of several types(server, network,
disk etc) can happen. At that time, Spark will not “know” that you changed the RDD in-place
and will use any version of any partition of the RDD to be retried. Retries require idempotency
and that is difficult without immutability. I believe, this is one of the primary reasons
for making RDDs immutable in Spark (mutable isn't even an option worth considering). In general
mutating something in a distributed system is a hard problem. It can be solved (e.g. in NoSQL
or newSQL databases) but Spark is not a transactional data store.
If you are building an iterative machine learning algorithm which usually have a “reduce”
step at the end of every iteration, then the lazy evaluation is unlikely to be useful. On
the other hand, if these intermediate RDDs stay in the young generation of the JVM heap [I
am not sure if RDD cache management somehow changes this, so I could be wrong] they are garbage
collected quickly and with very little overhead.
This is the price of scaling out :-)	Hope this helps,Mohit.
On Dec 6, 2014, at 5:02 AM, Mayur Rustagi <> wrote:You'll benefit
by viewing Matei's talk in Yahoo on Spark internals and how it optimizes execution of iterative

Simple answer is 

1. Spark doesn't materialize RDD when you do an iteration but lazily captures the transformation
functions in RDD.(only function and closure , no data operation actually happens)

2. When you finally execute and want to cause effects (save to disk , collect on master etc)
it views the DAG of execution and optimizes what it can reason (eliminating intermediate states
, performing multiple Transformations in one tasks, leveraging partitioning where available
among others)

Bottom line it doesn't matter how many RDD you have in your DAG chain as long as Spark can
optimize the functions in that DAG to create minimal materialization on its way to final output.



On 06-Dec-2014 6:12 pm, "Ron Ayoub" <> wrote:

This is from a separate thread with a differently named title. 
Why can't you modify the actual contents of an RDD using forEach? It appears to be working
for me. What I'm doing is changing cluster assignments and distances per data item for each
iteration of the clustering algorithm. The clustering algorithm is massive and iterates thousands
of times. As I understand it now, you are supposed to create new RDDs on each pass. This is
a hierachical k-means that I'm doing and hence it is consist of many iterations rather than
large iterations.
So I understand the restriction of why operation when aggregating and reducing etc, need to
be associative. However, forEach operates on a single item. So being that Spark is advertised
as being great for iterative algorithms since it operates in-memory, how can it be good to
create thousands upon thousands of RDDs during the course of an iterative algorithm?  Does
Spark have some kind of trick like reuse behind the scenes - fully persistent data objects
or whatever? How can it possibly be efficient for 'iterative' algorithms when it is creating
so many RDDs as opposed to one? 
Or is the answer that I should keep doing what I'm doing because it is working even though
it is not theoretically sound and aligned with functional ideas. I personally just want it
to be fast and be able to operate on up to 500 million data items.  		 	   		  

View raw message