spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From derrickburns <derrickrbu...@gmail.com>
Subject Re: KMeans with large clusters Java Heap Space
Date Fri, 30 Jan 2015 10:11:11 GMT
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e.  a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster.  Hence, the memory needed grows as the number of clusters times 8M
bytes (8 bytes per double)....

You should try to use my new   generalized kmeans clustering package
<https://github.com/derrickburns/generalized-kmeans-clustering>  , which
works on high dimensional sparse data.  

You will want to use the RandomIndexing embedding:

def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = {
    KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI)
 }



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p21437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message