spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajani Maski <rajani.ma...@gmail.com>
Subject Spark ML : k-means producing skewed cluster sizes
Date Thu, 28 Sep 2017 20:47:28 GMT
Dear Spark-User Group,

Short question

     Spark k-means is consistently producing highly skewed cluster size
distributions in my experiments.  Majority of the data-points are assigned
to one cluster.  Has anyone else experienced this behavior?

Longer version

     I am experimenting with Spark k-means library and have been observing
highly skewed cluster size distributions. For example, in an experiment
with about 10,000 data-points, and k=20, close to 95% of the data-points
are assigned to one cluster, and the remaining < 5% of the data-points are
assigned to the other 19 clusters.  See the Figure 1 below. This experiment
was conducted using the 20 Newsgroup data for which ground truth is
available: the ~10K data-points were manually categorized into fairly
balanced 20 groups. http://qwone.com/~jason/20Newsgroups/

Initially I suspected that the vector creation step (using Spark's
HashingTF and IDF libraries) was the cause of the incorrect clustering.
However, even after implementing my own version of TF-IDF based vector
representation I still got similar clustering results with highly skewed
size distribution.

Eventually I implemented my own version of k-means which uses TF-IDF vector
representation, and (-ve) cosine similarity as the distance metric.  The
results from this k-means look right.  See the Figure 2 below.

So, the question is why does Spark k-means perform the way it does?  Is it
that the similarity metric (Euclidean distance) not appropriate for text
data??

Any thoughts and pointers will be highly appreciated.

[image: Inline image 1]
Thanks & Regards,
Rajani

Mime
View raw message