mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Limon <julian.li...@tukipa.com>
Subject Re: LDA from Lucene Indexes
Date Wed, 04 May 2011 17:22:54 GMT
Thanks, Jake!

I also need certain files that are generated in the seq2sparse process (tf),
so lucene.vector might not be the best choice. I'll take a look at dumping
stored fields, then.

Thanks

2011/5/4 Jake Mannix <jake.mannix@gmail.com>

> On Wed, May 4, 2011 at 8:53 AM, Julian Limon <julian.limon@tukipa.com
> >wrote:
>
> > This sounds really interesting. Is there a way to dump certain fields
> from
> > a
> > Lucene index to text files?
> >
> > If so, I could use Lucene to do the parsing, and then seqdirectory and
> > seq2sparse to generate Mahout vectors out of these files.
> >
>
> You need to either have the fields Store.YES, or TermVector.YES for this
> to work.  If you have the latter, then you don't need them in text files,
> you
> can use the usual lucene.vector script to produce mahout vectors.
>
> To dump stored fields, we don't currently have a script to do that, but it
> should be another 5 lines of code to write one (ok, 25 lines, including
> boilerplate, damn java).  File a ticket, there are lots of people around
> here
> who could write that code.
>
>  -jake
>
>
> > Thanks,
> >
> > Julian
> >
> > 2011/5/3 Jake Mannix <jake.mannix@gmail.com>
> >
> > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gsingers@apache.org>
> > > wrote:
> > >
> > > >
> > > > > Although technically, we could add the capability to take a
> Store.YES
> > > > field
> > > > > and re-tokenize and
> > > > > build vectors from this as well.
> > > >
> > > > True, or we could just dump stored fields out to text and use the
> > > existing
> > > > text converter
> > >
> > >
> > > That would probably be the right way to do that, actually.
> > >
> > >  -jake
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message