spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Ayoub <>
Subject RE: Java RDD Union
Date Sat, 06 Dec 2014 12:48:08 GMT
Hiearchical K-means require a massive amount of iterations whereas flat K-means does not but
I've found flat to be generally useless since in most UIs it is nice to be able to drill down
into more and more specific clusters. If you have 100 million documents and your branching
factor is 8 (8-secting k-means) then you will be picking a cluster to split and iterating
thousands of times. So per split you iterate maybe 6 or 7 times to get new cluster assignments
and there are ultimately going to be 5,000 to 50,000 splits depending on split criterion and
cluster variances etc... 
In this case fault tolerance doesn't matter. I've found that the distributed aspect of RDD
is what I'm looking for and don't care or need the resilience part as much. It is a one off
algorithm and that can just be run again if something goes wrong. Once the data is created
it is done with Spark. 
But anyway, that is the very thing Spark is advertised for. 

> From:
> Date: Sat, 6 Dec 2014 06:39:10 -0600
> Subject: Re: Java RDD Union
> To:
> CC:
> I guess a major problem with this is that you lose fault tolerance.
> You have no way of recreating the local state of the mutable RDD if a
> partition is lost.
> Why would you need thousands of RDDs for kmeans? it's a few per iteration.
> An RDD is more bookkeeping that data structure, itself. They don't
> inherently take up resource, unless you mark them to be persisted.
> You're paying the cost of copying objects to create one RDD from next,
> but that's mostly it.
> On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub <> wrote:
> > With that said, and the nature of iterative algorithms that Spark is
> > advertised for, isn't this a bit of an unnecessary restriction since I don't
> > see where the problem is. For instance, it is clear that when aggregating
> > you need operations to be associative because of the way they are divided
> > and combined. But since forEach works on an individual item the same problem
> > doesn't exist.
> >
> > As an example, during a k-means algorithm you have to continually update
> > cluster assignments per data item along with perhaps distance from centroid.
> > So if you can't update items in place you have to literally create thousands
> > upon thousands of RDDs. Does Spark have some kind of trick like reuse behind
> > the scenes - fully persistent data objects or whatever. How can it possibly
> > be efficient for 'iterative' algorithms when it is creating so many RDDs as
> > opposed to one?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
View raw message