mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charly Lizarralde <charly.lizarra...@gmail.com>
Subject Re: Converting one large text file with multiple documents to SequenceFile format
Date Wed, 31 Oct 2012 20:30:48 GMT
I don't think you need that. Just a simple mapper.

static class IdentityMapper extends  Mapper<LongWritable, Text, Text, Text>
{

        @Override
        protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

            String[] fields = value.toString().split("\t") ;
            if  ( fields.length >= 2) {
                context.write(new Text(fields[0]), new Text(fields[1]))
;
            }

        }

    }

and then run a simple job..

        Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
Text.class, Text.class, SequenceFileOutputFormat.class) ;

        text2SequenceFileJob.setOutputKeyClass(Text.class) ;
        text2SequenceFileJob.setOutputValueClass(Text.class) ;
        text2SequenceFileJob.setNumReduceTasks(0) ;

        text2SequenceFileJob.waitForCompletion(true) ;

Cheers!
Charly

On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <levar1@hotmail.com> wrote:

>
> Yeah, I've looked at filter classes, but nothing worked.  I guess I'll do
> something similar and continuously save each line into a file and then run
> seqdiretory.  The running time won't look good, but at least it should
> work.  Thanks for the response.
>
> Nick
>
> > From: charly.lizarralde@gmail.com
> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > Subject: Re: Converting one large text file with multiple documents to
> SequenceFile format
> > To: user@mahout.apache.org
> >
> > I had the exact same issue and I tried to use the seqdirectory command
> with
> > a different filter class but It did not work. It seems there's a bug in
> the
> > mahout-0.6 code.
> >
> > It ended up as writing a custom map-reduce program that performs just
> that.
> >
> > Greetiings!
> > Charly
> >
> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <levar1@hotmail.com>
> wrote:
> >
> > >
> > > I have done a lot of searching on the web for this, but I've found
> > > nothing, even though I feel like it has to be somewhat common. I have
> used
> > > Mahout's 'seqdirectory' command to convert a folder containing text
> files
> > > (each file is a separate document) in the past. But in this case there
> are
> > > so many documents (in the 100,000s) that I have one very large text
> file in
> > > which each line is a document. How can I convert this large file to
> > > SequenceFile format so that Mahout understands that each line should be
> > > considered a separate document?  Would it be better if the file was
> > > structured like so....docId1 {tab} document textdocId2 {tab} document
> > > textdocId3 {tab} document text...
> > >
> > > Thank you very much for any help.Nick
> > >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message