mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Woodward <lev...@hotmail.com>
Subject RE: Converting one large text file with multiple documents to SequenceFile format
Date Mon, 12 Nov 2012 00:18:00 GMT

Diego,Thank you so much for the script. I used it to convert my large text file to a sequence
file. I have been trying to use the sequence file to feed Mahout's LDA implementation (Mahout
0.7 so the CVB implementation).  I first converted the sequence file to vectors with this,
"mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf -nr 7" and then ran the LDA
with this, "mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o topics -dt documents
-mt states -ow -k 100 --num_reduce_tasks 7 -x 10".  The seq2sparse command produces the tf
vectors alright, but the problem is that no matter what I use for parameters, the LDA job
sits at map 0% reduce 0% for an hour before outputting the error below.  It has an error casting
Text to IntWritable.  My question is when you said that the key is the line number, what variable
type is the key?  Is it Text?

My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line arguments: {--convergenceDelta=[0],
--dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001],
--endPhase=[2147483647], --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10],
--max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4],
--num_update_threads=[1], --output=[topics], --overwrite=null, --startPhase=[0], --tempDir=[temp],
--term_topic_smoothing=[0.0001], --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/11
16:10:52 INFO mapred.JobClient: Running job: job_201211111553_000512/11/11 16:10:53 INFO mapred.JobClient:
 map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id : attempt_201211111553_0005_m_000003_0,
Status : FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) at
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at
org.apache.hadoop.mapred.Child.main(Child.java:249)"

Thank you again for your help!Nick


> From: diego.ceccarelli@gmail.com
> Date: Thu, 1 Nov 2012 01:07:29 +0100
> Subject: Re: Converting one large text file with multiple documents to SequenceFile format
> To: user@mahout.apache.org
> 
> Hei Nick,
> I had exatly the same problem ;)
> I wrote a simple command line utility to create a sequence
> file where each line of the input document is an entry
> (the key is the line number).
> 
> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> 
> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> -input tweets -output tweets.seq
> 
> enjoy ;)
> Diego
> 
> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> <charly.lizarralde@gmail.com> wrote:
> > I don't think you need that. Just a simple mapper.
> >
> > static class IdentityMapper extends  Mapper<LongWritable, Text, Text, Text>
> > {
> >
> >         @Override
> >         protected void map(LongWritable key, Text value, Context context)
> > throws IOException, InterruptedException {
> >
> >             String[] fields = value.toString().split("\t") ;
> >             if  ( fields.length >= 2) {
> >                 context.write(new Text(fields[0]), new Text(fields[1]))
> > ;
> >             }
> >
> >         }
> >
> >     }
> >
> > and then run a simple job..
> >
> >         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> >
> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> >
> >         text2SequenceFileJob.waitForCompletion(true) ;
> >
> > Cheers!
> > Charly
> >
> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <levar1@hotmail.com> wrote:
> >
> >>
> >> Yeah, I've looked at filter classes, but nothing worked.  I guess I'll do
> >> something similar and continuously save each line into a file and then run
> >> seqdiretory.  The running time won't look good, but at least it should
> >> work.  Thanks for the response.
> >>
> >> Nick
> >>
> >> > From: charly.lizarralde@gmail.com
> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> >> > Subject: Re: Converting one large text file with multiple documents to
> >> SequenceFile format
> >> > To: user@mahout.apache.org
> >> >
> >> > I had the exact same issue and I tried to use the seqdirectory command
> >> with
> >> > a different filter class but It did not work. It seems there's a bug in
> >> the
> >> > mahout-0.6 code.
> >> >
> >> > It ended up as writing a custom map-reduce program that performs just
> >> that.
> >> >
> >> > Greetiings!
> >> > Charly
> >> >
> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <levar1@hotmail.com>
> >> wrote:
> >> >
> >> > >
> >> > > I have done a lot of searching on the web for this, but I've found
> >> > > nothing, even though I feel like it has to be somewhat common. I have
> >> used
> >> > > Mahout's 'seqdirectory' command to convert a folder containing text
> >> files
> >> > > (each file is a separate document) in the past. But in this case there
> >> are
> >> > > so many documents (in the 100,000s) that I have one very large text
> >> file in
> >> > > which each line is a document. How can I convert this large file to
> >> > > SequenceFile format so that Mahout understands that each line should
be
> >> > > considered a separate document?  Would it be better if the file was
> >> > > structured like so....docId1 {tab} document textdocId2 {tab} document
> >> > > textdocId3 {tab} document text...
> >> > >
> >> > > Thank you very much for any help.Nick
> >> > >
> >>
> >>
> 
> 
> 
> -- 
> Computers are useless. They can only give you answers.
> (Pablo Picasso)
> _______________
> Diego Ceccarelli
> High Performance Computing Laboratory
> Information Science and Technologies Institute (ISTI)
> Italian National Research Council (CNR)
> Via Moruzzi, 1
> 56124 - Pisa - Italy
> 
> Phone: +39 050 315 3055
> Fax: +39 050 315 2040
> ________________________________________
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message