mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Tags generation?
Date Tue, 07 Aug 2012 12:54:54 GMT
Nice stuff.  And glad that Mahout was able to help!

On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY <samikc@gmail.com> wrote:

> Hi All,
>
> We have developed an auto tagging system for our micro-blogging platform.
> Here is what we have done:
>
> The purpose of the system was to look for tags in an articles automatically
> when someone posts a link in our micro-blogging site. The goal was to allow
> us to follow a tag instead (in addition) of (to) a person. So we used some
> custom code on top of Mahout, UIMA, Open-NLP etc.
>
> If you are interested to see how it works take a look at:
> http://www.scoopspot.com/
>
> One more thing, we also created a robot which goes to some of the well
> known web sites like: Read Write Web, Hackers News, Tech Crunch etc which
> gets the article from the web and publishes that to our micro-blog. As we
> already have the tag following, we get the information without any problem.
> That's very cool (to us at least). You can see the output of the robot at
> this location:
>
> http://news.scoopspot.com/
>
> I thought, this might be an example of what Mahout can do and related to
> this thread, so felt like sharing with you guys.
>
> Sorry if it looks like off-topic.
>
> Regards,
> Samik
>
> On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <goksron@gmail.com> wrote:
>
> > I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
> > 'verb', etc. I removed all words that were not nouns or verbs. In my
> > use case, this is a total win. In other cases, maybe not: Twitter has
> > a quite varied non-grammer.
> >
> > On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <pat@farfetchers.com> wrote:
> > > The way back from stem to tag is interesting from the standpoint of
> > making tags human readable. I had assumed a lookup but this seems much
> more
> > satisfying and flexible. In order to keep frequencies it will take
> > something like a dictionary creation step in the analyzer. This in turn
> > seems to imply a join so a whole new map reduce job--maybe not completely
> > trivial?
> > >
> > > It seems that NLP can be used in two very different ways here. First as
> > a filter (keep only nouns and verbs?) second to differentiate semantics
> > (can:verb, can:noun). One method is a dimensional reduction technique the
> > other increases dimensions but can lead to orthogonal dimensions from the
> > same term. I suppose both could be used together as the above example
> > indicates.
> > >
> > > It sounds like you are using it to filter (only?) Can you explain what
> > you mean by:
> > > "One thing came through- parts-of-speech selection for nouns and verbs
> > > helped 5-10% in every combination of regularizers.'
> > >
> > >
> > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <goksron@gmail.com> wrote:
> > >
> > > Thanks everyone- I hadn't considered the stem/synonym problem. I have
> > > code for regularizing a doc/term matrix with tf, binary, log and
> > > augmented norm for the cells and idf, gfidf, entropy, normal (term
> > > vector) and probabilistic inverse. Running any of these, and then SVD,
> > > on a Reuters article may take 10-20 ms. This uses a sentence/term
> > > matrix for document summarization. After doing all of this, I realized
> > > that maybe just the regularized matrix was good enough.
> > >
> > > One thing came through- parts-of-speech selection for nouns and verbs
> > > helped 5-10% in every combination of regularizers. All across the
> > > board. If you want good tags, select your parts of speech!
> > >
> > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
> > > <dawid.weiss@cs.put.poznan.pl> wrote:
> > >> I know, I know. :) Just wanted to mention that it could lead to funny
> > >> results, that's all. There are lots of way of doing proper form
> > >> disambiguation, including shallow tagging which then allows to
> > >> retrieve correct base forms for lemmas, not stems. Stemming is
> > >> typically good enough (and fast) so your advise was 100% fine.
> > >>
> > >> Dawid
> > >>
> > >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >>> This is definitely just the first step.  Similar goofs happen with
> > >>> inappropriate stemming.  For instance, AIDS should not stem to aid.
> > >>>
> > >>> A reasonable way to find and classify exceptional cases is to look
at
> > >>> cooccurrence statistics.  The contexts of original forms can be
> > examined to
> > >>> find cases where there is a clear semantic mismatch between the
> > original
> > >>> and the set of all forms that stem to the same form.
> > >>>
> > >>> But just picking the most common that is present in the document is
a
> > >>> pretty good step for all that it produces some oddities.  The results
> > are
> > >>> much better than showing a user the stemmed forms.
> > >>>
> > >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <
> > dawid.weiss@cs.put.poznan.pl>wrote:
> > >>>
> > >>>>> Unstemming is pretty simple.  Just build an unstemming dictionary
> > based
> > >>>> on
> > >>>>> seeing what word forms have lead to a stemmed form.  Include
> > frequencies.
> > >>>>
> > >>>> This can lead to very funny (or not, depends how you look at it)
> > >>>> mistakes when different lemmas stem to the same token. How frequent
> > >>>> and important this phenomenon is varies from language to language
> (and
> > >>>> can be calculated apriori).
> > >>>>
> > >>>> Dawid
> > >>>>
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > goksron@gmail.com
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message