spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From derrickburns <>
Subject Re: KMeans with large clusters Java Heap Space
Date Fri, 30 Jan 2015 10:11:11 GMT
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e.  a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster.  Hence, the memory needed grows as the number of clusters times 8M
bytes (8 bytes per double)....

You should try to use my new   generalized kmeans clustering package
<>  , which
works on high dimensional sparse data.  

You will want to use the RandomIndexing embedding:

def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = {
    KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI)

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message