mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Ortega <>
Subject Re: Pointer to Reference Docs
Date Mon, 17 Sep 2012 09:32:17 GMT
The *seqdirectory *command takes every file in the specified directory and
makes a Hadoop Sequence File
<>out of it. Sequence Files
have a key and a value, and in the case you want
to turn a list of file into Sequence Files then the file name will be the
key and the file contents will be the value. Nonetheless, this is quite
unpractical if your corpus is large as disk reading and writing can become
painfully slow. You might want to have a look at this discussion on
discusses how to use the Sequence File API to transform a key-value
CSV file into sequence files

The *seq2sparse *Mahout shell command converts the text documents in
Sequence File format to vectors using either TF or
TF-IDF<*idf>weighting with n-gram

I suggest looking at this quick
now, but I would strongly recommend reading the Mahout in Action
specifically chapter 8.

Hope this helps

On Mon, Sep 17, 2012 at 11:18 AM, David Scarlatti <>wrote:

> Hi, I'd appreciate  any hint on the best source of reference information...
> I've found different examples and quick guides but If I want to know i.e.
> what seqdirecoty or seq2sparse exactly does and which are the different
> command line options with a detailed description, I can't find the place...
> Is this something still to do in Mahout? Should I look to the source code
> to knos this?
> Thanks in advance.
> --
> -----
> David.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message