mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LDA from Lucene Indexes
Date Wed, 04 May 2011 17:08:44 GMT
On Wed, May 4, 2011 at 8:53 AM, Julian Limon <julian.limon@tukipa.com>wrote:

> This sounds really interesting. Is there a way to dump certain fields from
> a
> Lucene index to text files?
>
> If so, I could use Lucene to do the parsing, and then seqdirectory and
> seq2sparse to generate Mahout vectors out of these files.
>

You need to either have the fields Store.YES, or TermVector.YES for this
to work.  If you have the latter, then you don't need them in text files,
you
can use the usual lucene.vector script to produce mahout vectors.

To dump stored fields, we don't currently have a script to do that, but it
should be another 5 lines of code to write one (ok, 25 lines, including
boilerplate, damn java).  File a ticket, there are lots of people around
here
who could write that code.

  -jake


> Thanks,
>
> Julian
>
> 2011/5/3 Jake Mannix <jake.mannix@gmail.com>
>
> > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gsingers@apache.org>
> > wrote:
> >
> > >
> > > > Although technically, we could add the capability to take a Store.YES
> > > field
> > > > and re-tokenize and
> > > > build vectors from this as well.
> > >
> > > True, or we could just dump stored fields out to text and use the
> > existing
> > > text converter
> >
> >
> > That would probably be the right way to do that, actually.
> >
> >  -jake
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message