mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: LDA from Lucene Indexes
Date Wed, 04 May 2011 17:33:32 GMT
It might be that the right thing is to just tweak the current seq2saprse
process.

Jake,

is that what you were thinking?

On Wed, May 4, 2011 at 10:22 AM, Julian Limon <julian.limon@tukipa.com>wrote:

> Thanks, Jake!
>
> I also need certain files that are generated in the seq2sparse process
> (tf),
> so lucene.vector might not be the best choice. I'll take a look at dumping
> stored fields, then.
>
> Thanks
>
> 2011/5/4 Jake Mannix <jake.mannix@gmail.com>
>
> > On Wed, May 4, 2011 at 8:53 AM, Julian Limon <julian.limon@tukipa.com
> > >wrote:
> >
> > > This sounds really interesting. Is there a way to dump certain fields
> > from
> > > a
> > > Lucene index to text files?
> > >
> > > If so, I could use Lucene to do the parsing, and then seqdirectory and
> > > seq2sparse to generate Mahout vectors out of these files.
> > >
> >
> > You need to either have the fields Store.YES, or TermVector.YES for this
> > to work.  If you have the latter, then you don't need them in text files,
> > you
> > can use the usual lucene.vector script to produce mahout vectors.
> >
> > To dump stored fields, we don't currently have a script to do that, but
> it
> > should be another 5 lines of code to write one (ok, 25 lines, including
> > boilerplate, damn java).  File a ticket, there are lots of people around
> > here
> > who could write that code.
> >
> >  -jake
> >
> >
> > > Thanks,
> > >
> > > Julian
> > >
> > > 2011/5/3 Jake Mannix <jake.mannix@gmail.com>
> > >
> > > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gsingers@apache.org
> >
> > > > wrote:
> > > >
> > > > >
> > > > > > Although technically, we could add the capability to take a
> > Store.YES
> > > > > field
> > > > > > and re-tokenize and
> > > > > > build vectors from this as well.
> > > > >
> > > > > True, or we could just dump stored fields out to text and use the
> > > > existing
> > > > > text converter
> > > >
> > > >
> > > > That would probably be the right way to do that, actually.
> > > >
> > > >  -jake
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message