mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Clustering a large crawl
Date Mon, 04 Jun 2012 20:58:14 GMT
I do have rowsimilarity calculated for each doc using the same measure 
as clustering and that produces pretty good results as far as the 
eyeball take you. I assume this is what you mean by using the doc as a 

On 6/4/12 9:14 AM, Ted Dunning wrote:
> Even having millions of dimensions isn't all that bad if that induces a
> reasonable distance between documents.  The easy way to test that is to use
> several document vectors as queries and see whether the closest other
> documents appear to you to be very similar.  If this is true for a number
> of documents, you should be good to go with whatever metric you are using.
> For fast clustering, you may need a low-dimensional surrogate metric so
> that you can get higher throughput, but the point of the low-dimensional
> surrogate is that it *replicates* the behavior of the metric that you
> really want.  It isn't going to make your metric better.
> On Mon, Jun 4, 2012 at 5:15 PM, Pat Ferrel<>  wrote:
>> After looking again at the dictionary for 150,000 web pages I have 259,000
>> dimensions! Part of the problem is I can't get Tika to detect language very
>> well (working on this) so I get groups of non-english pages that throw in
>> quite a few new terms. Overall I think some form of dimensional reduction
>> is called for, no?

View raw message