mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Lange <>
Subject Re: Cluster text docs
Date Sat, 19 Dec 2009 16:16:17 GMT
 Hi there,

I would like to add some thoughts about feature selection to this
I'm working on the topic-clustering project at the tu-berlin, that has
already been discussed in this mailing-lst (e.g.

Choosing the right feature-extraction and clustering algorithms is one part
of the story, but what should be the input to this algorithms in the first

In  a thread about preprocessing, my colleague Marc presented our UIMA-based
sum this up, our pipeline implemens the following preprocessing-steps:
stripping of html-tags>pos-tagging and noun-group-chunking, both via
wrappers for lingpipe-annotators > stemming > stoppword-filtering. So we
could actually pass stemmed words  without stopwords to the
feature-extactor. but there are more effective (and probably less
data-intensive) possiblities.
Think about a sentence like this one:
(1) "The Me 262 is well known as  the world's first fighter aircraft with a
jet engine".
If you do topic-clustering, which words give a proper representation of this
sentence's topic? a good guess seems to be to take the non phrases, i.e.
"The Me 262" , "the world's first fighter aircraft", and "a jet engine". Our
noun chunker can easily achieve this, if we include number words (262) into
the set of grammatical categories occurring inside a noun phrase. But if we
stop here, we miss a generalization: a text with a chunk "fighter aircrafts"
probably has the same topic. but if we pass them over as an atomic feature,
we end up without a match, because this chunk is not string-identical to
"the world's first fighter aircraft". To make the
feature-extractor/clusterer reecognize the similarity we do the following:
stemming (strips off the "s"), excluding determiners ("the") inside chunks,
and building the power-set from every chunk, that reflects the grammatical
structure. for  "the world's first fighter aircraft", we end up with the
set{"world's first fighter aircraft","first fighter aircraft", "fighter
aircraft" ,"aircraft"}, thus detecting the similarity to chunk "fighter
aircrafts" (after stemming, that is). One could argue: Why take complete
noun-chunks in the first place, when they cannot be easily matched with
other phrases? This is because noun groups can carry meanings that cannot be
calculated from their parts. For example, a chunk "bag of words" offers an
excellent gues as to what is article is about (namely, text processing). But
that is not clear if you only look at the single words "bag" , "of" and

As for the words that are not nouns or parts of noun-chunks, many of them
can be left aside.  For example, a word like "good" is not that specific
when it comes to topic clustering. "good" is an adjective, "aircraft" is a
noun. so a selection of topic-specific words can be done on the basis of
grammatical categories. that's what we have the POS-Tagger for.

Any comments on this approach are of course welcome.  In particular, I have
a question about building n-grams (subsets) from noun-chunks. In the
power-sets of noun-chunks, we don't want to have subsets like "world's
first". That would surely spoil the clustering. Every subset should include
the grammatical core of the chunk, in this example, "aircraft". Lingpipe's
noun-chunker is not able to do this, because it's based on a sequential
parse of the POS-Tags. If you have a chunk "wizard of warcraft", the core of
the chunk is "wizard", appearing on the outer left of the chunk. In order to
detect it, we need a deep parser. But this seems to be much  costly. On an
off-the-shelf dual-core computer with 4 gigs of memory, we can do the
preprocessing of this e-mail within half of a second. That would change
dramatically if we would use a deep-parser. Or am I wrong?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message