mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: LDA from Lucene Indexes
Date Thu, 05 May 2011 12:54:46 GMT

On May 4, 2011, at 2:31 PM, Jake Mannix wrote:

> On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
>> Pipelining is good for abstraction and really bad for performance (in the
>> map-reduce world).
>> 
>> My thought is that we could have a multipurpose tool.  Input would be a
>> lucene index and the program would read term vectors or original text as
>> available.  Output would be either sequence file full of text or sequence
>> file full of vectors.
>> 
> 
> Ok, sure, then this is modifying the lucene.vectors code, not the
> seq2sparse code, right?

Easiest is to dump to text and then use seq2sparse which has all of the functionality for
tokenizing, etc.   As Jake said, it's about 5 lines of code plus boilerplate.  I think I even
have some lying around somewhere.

If we go the route suggested here by Ted, we likely
should refactor both lucene.vec and seq2sparse to have a shared piece for doing the analysis.
 After all, it's entirely feasible that one would want to even postprocess what comes out
of the term vector too (for instance, if it wasn't stemmed before or if you wanted more aggressive
stopword removal)

-Grant


Mime
View raw message