spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Solimando <>
Subject Re: K Means Clustering Explanation
Date Fri, 02 Mar 2018 08:37:30 GMT
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).

For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on
this topic, although I have no pointers to provide).


On 1 March 2018 at 21:59, Christoph Brücke <> wrote:

> Hi Matt,
> I see. You could use the trained model to predict the cluster id for each
> training point. Now you should be able to create a dataset with your
> original input data and the associated cluster id for each data point in
> the input data. Now you can group this dataset by cluster id and aggregate
> over the original 5 features. E.g., get the mean for numerical data or the
> value that occurs the most for categorical data.
> The exact aggregation is use-case dependent.
> I hope this helps,
> Christoph
> Am 01.03.2018 21:40 schrieb "Matt Hicks" <>:
> Thanks for the response Christoph.
> I'm converting large amounts of data into clustering training and I'm just
> having a hard time reasoning about reversing the clusters (in code) back to
> the original format to properly understand the dominant values in each
> cluster.
> For example, if I have five fields of data and I've trained ten clusters
> of data I'd like to output the five fields of data as represented by each
> of the ten clusters.
> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke wrote:
>> Hi matt,
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>> Can you be a little bit more specific about your use-case?
>> Best,
>> Christoph
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" <>:
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>> Is there a proper way to do this in Spark ML?

View raw message