mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Cluster text docs
Date Sat, 19 Dec 2009 17:27:49 GMT
I think you are making a very big (and very wrong) assumption here.

The non-grammaticality of these chunks does not generally adversely affect
topic identification and can actually help it quite a bit.

It is important to avoid "everybody knows" facts in your development at this
point.  Even if everybody you talk to agrees that you don't even need to
look at the data on this topic, you should still be suspicious of strong
statements without data.

On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <> wrote:

> In particular, I have a question about building n-grams (subsets) from
> noun-chunks. In the
> power-sets of noun-chunks, we don't want to have subsets like "world's
> first". That would surely spoil the clustering. Every subset should include
> the grammatical core of the chunk, in this example, "aircraft".

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message