mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: Dictionary file format in Lucene-Mahout integration
Date Thu, 06 Jun 2013 23:04:45 GMT

Thinking loud here?  In light of the fix for Mahout-944 (lucene2seq utility) which has been
committed to trunk, do we still need to maintain lucene.vector?

The path then would be lucene2seq -> seq2sparse -> rowid -> cvb.

 From: Grant Ingersoll <>
To:; James Forth <> 
Sent: Wednesday, June 5, 2013 10:46 AM
Subject: Re: Dictionary file format in Lucene-Mahout integration

File dictOutFile = new File(dictOut);"Dictionary Output file: {}", dictOutFile);
    Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8);
    DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, delimiter, field);
    try {
    } finally {

Is the culprit in the Lucene Driver class.  The way to fix this would be to abstract the
writer and allow it to use other implementations, namely one that supported the seq 2 sparse

Any chance you are up for patching it James?


On Jun 5, 2013, at 2:00 AM, James Forth <> wrote:

> Hello,
> I’m wondering if anyone can help with a question about the dictionary format in
> lucene.vector-cvb integration.  I’ve previously used the pathway from text
> files:  seqdirectory >
> seq2sparse > rowid > cvb  and it works fine.  The
> dictionary created by seq2sparse is in sequence file format, and this is accepted by
> But when using a pathway from a lucene index:  lucene.vector > cvb  there is a problem
with cvb throwing the error “dict.out not a SequenceFile”. 
> Lucene.vector appears to generate a dictionary in plain text format, but cvb
> requires it in sequence file format.
> Does anyone know how to use lucence.vector with cvb, which I assume means
> obtaining a dictionary as a sequence file from lucene.vector?
> Thanks for your help.
> James

Grant Ingersoll | @gsingers
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message