mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Clustering raw articles vs clustering (Stanford's) NER output
Date Mon, 12 May 2014 20:58:29 GMT

Clustering with higher level data available for the distance computation is a fine thing.

The tuning will be very different but the results can be very good when the named entity resolver
gets a good hit.  Since named entities tend to be relatively rare, they get high IDF scores
and other terms recede a bit as a result if normalization.  

Sent from my iPhone

> On May 12, 2014, at 6:29, David Noel <> wrote:
> I've spent a few weeks tuning Mahout to cluster news articles and have
> had decent results. Decent, but still not perfect. In trying to think
> of ways to improve my results I had the idea of running Mahout on
> output from Stanford's Named Entity Recognizer (NER) instead of the
> articles themselves, and seeing how that compared. Has anyone tried
> this? Did it generate more cohesive clusters?

View raw message