mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Tags generation?
Date Sat, 04 Aug 2012 01:31:28 GMT
Thanks everyone- I hadn't considered the stem/synonym problem. I have
code for regularizing a doc/term matrix with tf, binary, log and
augmented norm for the cells and idf, gfidf, entropy, normal (term
vector) and probabilistic inverse. Running any of these, and then SVD,
on a Reuters article may take 10-20 ms. This uses a sentence/term
matrix for document summarization. After doing all of this, I realized
that maybe just the regularized matrix was good enough.

One thing came through- parts-of-speech selection for nouns and verbs
helped 5-10% in every combination of regularizers. All across the
board. If you want good tags, select your parts of speech!

On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
<> wrote:
> I know, I know. :) Just wanted to mention that it could lead to funny
> results, that's all. There are lots of way of doing proper form
> disambiguation, including shallow tagging which then allows to
> retrieve correct base forms for lemmas, not stems. Stemming is
> typically good enough (and fast) so your advise was 100% fine.
> Dawid
> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <> wrote:
>> This is definitely just the first step.  Similar goofs happen with
>> inappropriate stemming.  For instance, AIDS should not stem to aid.
>> A reasonable way to find and classify exceptional cases is to look at
>> cooccurrence statistics.  The contexts of original forms can be examined to
>> find cases where there is a clear semantic mismatch between the original
>> and the set of all forms that stem to the same form.
>> But just picking the most common that is present in the document is a
>> pretty good step for all that it produces some oddities.  The results are
>> much better than showing a user the stemmed forms.
>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <>wrote:
>>> > Unstemming is pretty simple.  Just build an unstemming dictionary based
>>> on
>>> > seeing what word forms have lead to a stemmed form.  Include frequencies.
>>> This can lead to very funny (or not, depends how you look at it)
>>> mistakes when different lemmas stem to the same token. How frequent
>>> and important this phenomenon is varies from language to language (and
>>> can be calculated apriori).
>>> Dawid

Lance Norskog

View raw message