lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislaw Osinski <stanis...@osinski.name>
Subject Re: News clustering
Date Mon, 03 Dec 2012 17:37:08 GMT
> I mean measuring the similarity between the document in each cluster.
> Also, difference between document on one cluster with another cluster.
>
> I saw the sample code ClusteringQualityBencmark.java
> However, I do not know how to make use of it for assessing my Solr
> Clustering performance.
>

You'd need to write your own code for this, here are the most common
clustering quality measures you mentioned:

http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

These are meant for the general case (numeric attributes), to apply them to
texts, you'd need to use the vector representation of the documents.

One a more general note, synthetic measures test only the document-cluster
assignments, but none take the quality of labels into account (this is
really hard to measure objectively).

Staszek

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message