mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: Questions on compressed input, custom tokenizers, and feature selection
Date Sun, 17 Nov 2013 06:47:41 GMT
Brian,

You can create a custom MR job for converting GZIP to SequenceFiles (don't have to use Mahout's
'seqdirectory' for this). 




On Saturday, November 16, 2013 1:32 PM, Brian Rogoff <brogoff@gmail.com> wrote:
 
Hi Suneel,
    Thanks for the quick reply!

    For problem 1, how would I do that? Would a workable solution be creating a new utility,
called say 'seqcmprdir', by 
creating a new SequenceFilesFromDirectory.java, specialized to work on gzipped files? Or is
there a better way? I'm a newcomer to Hadoop internals (but not Java) so sorry if this sounds
trivial.

   I'll look into specializing seqn2sparse. I'll clarify my problem and formulate the question
better, or just solve it. 


    I'd love to upgrade to Mahout 0.8 or even 0.9SNAPSHOT, but Mahout 0.8 is compiled with
a later version of guava collections than our current install of Hadoop. So 0.7 is the best
we can do for now. When ops upgrades our cluster to a newer cdh, maybe we can switch then. 

   Thanks again!

-- Brian




On Fri, Nov 15, 2013 at 5:51 PM, Suneel Marthi <suneel_marthi@yahoo.com> wrote:

Hi Brian,
>
>1. seqdirectory presently only works with Text files. You would have to create your own
utility for generating sequence files from gzip.
>
>    It should be easy to create an MR job that reads gzip files and creates Sequence
files.
>
>2. Custom Tokenizers:
>
>     Could you provide more specifics here?
>
>    If you are creating a Custom Lucene Tokenizer, then you should be able to plug
that into the call to seq2sparse (which is subsequent to seqdirectory in Mahout's processing
pipeline).
>
>
>
>
>      
>
>
>
>
>
>On Friday, November 15, 2013 7:05 PM, Brian Rogoff <brogoff@gmail.com> wrote:
>
>Hi,
>    I'm using Mahout 0.7 with Hadoop 0.20.2-cdh3u2, evaluating it for use
>within our company. I have a few questions
>
>    I'd like to use Mahout classification on some data that I have which is
>stored as gzipped files. I'd like to create the sequence data directly from
>those compressed files. Is there some file filter class I can use which
>will enable me to transparently work from the compressed data?
>
>    In case that isn't clear, consider the 20news example in the
>mahout-distribution-0.7. If I create a parallel directory to 20news-all
>where all of the leaf files are gzipped, say gzipped-news-all, I'd like to
>run
>
>./bin/mahout seqdirectory -i ${WORK_DIR}/gzipped-news-all -o
>${WORK_DIR}/gzipped-news-seq
>
>perhaps with another argument to indicate that the data input data is
>compressed, and have gzipped-news-seq be identical to 20news-seq dir
>resulting from running
>
>./bin/mahout seqdirectory -i ${WORK_DIR}/20news-all -o
>${WORK_DIR}/20news-seq
>
>    I'd like to see how to substitute custom tokenizers into this flow, if
>someone could point me to an example, and I'd also like to know if there
>are examples of tweaking the feature selection algorithms.
>
>    Thanks in advance!
>
>-- Brian
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message