mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris McConnell <>
Subject Re: LDA from Lucene Indexes
Date Thu, 05 May 2011 13:27:35 GMT
Hi guys,

I'm jumping back as the later emails jump into expansions (all of
which sound great), but I wanted to give this a better link back to
the original question.

This adjustment allowed me to get the vectors created, create the lda
input and grab the topics out of the final results.

I'm curious if anyone has done testing with the parameters at all.
Obviously different data will lead to different parameter needs
(number of topics, smoothing, iterations, etc.) but I'm wondering
particularly about "stop words." I believe I ran across some older
questions in the mailing list about this, where users were curious if
they could be specified in Mahout, or if we should be doing so within
the Lucene index creation, others?

Another thought I had, we have the dictionary output, if we were to
modify the dictionary to remove those stop words, would that have a
similar effect, or does the algorithm (haven't had a chance to dig
into it yet, so I apologize if this is obvious) require every word
within the vector to exist in the dictionary?

Thanks for all the help, I'm excited this chain has gathered some
steam within the community to improve the algorithm(s) surrounding
LDA, as we (GE) feel this library has great potential.


bin/mahout lda -i /user/TopicTrending/ -o
/user/TopicTrending/lda_output/ -k 5 -v 50000

On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <> wrote:
> Hi Chris,
>  That's what I thought.  This line needs to make sure you store termvectors
> (see this article<>for
> more details):
> On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
> <>wrote:
>> if (elementName.equals("doc")) {
>>                if(title && content){
>>                                doc.add(new
>> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
>>                                doc.add(new
>> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
> You want this to be:
> new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES);
> Although technically, we could add the capability to take a Store.YES field
> and re-tokenize and
> build vectors from this as well.
>  -jake

View raw message