mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasil Vasilev <vavasi...@gmail.com>
Subject Re: LDA from Lucene Indexes
Date Wed, 11 May 2011 15:10:44 GMT
Hi Chris,

I had a similar problem to what you describe. It turned out that many of the
words I wanted to "stop" are also words with high document frequency.
In order to avoid these words one option is to use maxDFPercent, but there
are to issues with this:
1. You should know what exactly percentage to select
2. It works only on the tfidf vectors and not on the tf ones (LDA uses the
latter)

You can take a look at
https://issues.apache.org/jira/browse/MAHOUT-688which provides one
possible solution.

On Thu, May 5, 2011 at 4:27 PM, Chris McConnell
<c.t.mcconnell.ge@gmail.com>wrote:

> Hi guys,
>
> I'm jumping back as the later emails jump into expansions (all of
> which sound great), but I wanted to give this a better link back to
> the original question.
>
> This adjustment allowed me to get the vectors created, create the lda
> input and grab the topics out of the final results.
>
> I'm curious if anyone has done testing with the parameters at all.
> Obviously different data will lead to different parameter needs
> (number of topics, smoothing, iterations, etc.) but I'm wondering
> particularly about "stop words." I believe I ran across some older
> questions in the mailing list about this, where users were curious if
> they could be specified in Mahout, or if we should be doing so within
> the Lucene index creation, others?
>
> Another thought I had, we have the dictionary output, if we were to
> modify the dictionary to remove those stop words, would that have a
> similar effect, or does the algorithm (haven't had a chance to dig
> into it yet, so I apologize if this is obvious) require every word
> within the vector to exist in the dictionary?
>
> Thanks for all the help, I'm excited this chain has gathered some
> steam within the community to improve the algorithm(s) surrounding
> LDA, as we (GE) feel this library has great potential.
>
> Best,
> Chris
>
> bin/mahout lda -i /user/TopicTrending/ -o
> /user/TopicTrending/lda_output/ -k 5 -v 50000
>
> On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > Hi Chris,
> >
> >  That's what I thought.  This line needs to make sure you store
> termvectors
> > (see this article<
> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
> >for
> > more details):
> >
> > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
> > <c.t.mcconnell.ge@gmail.com>wrote:
> >>
> >> if (elementName.equals("doc")) {
> >>                if(title && content){
> >>                                doc.add(new
> >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
> >>                                doc.add(new
> >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
> >
> >
> > You want this to be:
> >
> > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
> > Field.TermVector.YES);
> >
> > Although technically, we could add the capability to take a Store.YES
> field
> > and re-tokenize and
> > build vectors from this as well.
> >
> >  -jake
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message