mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Lange <fxla...@googlemail.com>
Subject Re: Cluster text docs
Date Mon, 21 Dec 2009 08:32:43 GMT
Hi ,
Ted, I agree, sentences don't need to be grammatical for our purposes. My
intention was just to cut out noun-less phrase like "very good". I just
think in general nouns say more about a topic than adjectives and so I can
leave them aside and make the feature vector a bit smaller.
@ Drew: Yes, we actually did some testing on unigrams, and the result
weren't that bad.

Greetings
Felix



2009/12/19 Ted Dunning <ted.dunning@gmail.com>

> I think you are making a very big (and very wrong) assumption here.
>
> The non-grammaticality of these chunks does not generally adversely affect
> topic identification and can actually help it quite a bit.
>
> It is important to avoid "everybody knows" facts in your development at
> this
> point.  Even if everybody you talk to agrees that you don't even need to
> look at the data on this topic, you should still be suspicious of strong
> statements without data.
>
> On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <fxlange@googlemail.com>
> wrote:
>
> > In particular, I have a question about building n-grams (subsets) from
> > noun-chunks. In the
> > power-sets of noun-chunks, we don't want to have subsets like "world's
> > first". That would surely spoil the clustering. Every subset should
> include
> > the grammatical core of the chunk, in this example, "aircraft".
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message