mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
Date Thu, 10 Jun 2010 17:28:10 GMT
Hi Jake,

Thanks very much for the help.  I looked into the problem a little deeper
and found that the org.apache.mahout.utils.vectors.lucene.Driver was writing
out LongWriters instead of IntWriters so I just changed the code in there.
Should this code be using IntWriters or LongWriters?

I managed to get the similarity matrix to be written to disk but I'm not at
all sure about the results.

My original input was 3 solr documents:

id1: A A B C
id2: B D D
id3: A B B E

After writing the to a sequence file and running your matrix transposition
and multiplication, I get an output called part-0000.  If I read it using $
mahout seqdumper --seqFile part-00000 then it outputs:

Input Path: part-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: org.apache.mahout.math.VectorWritable@288051
Key: 1: Value: org.apache.mahout.math.VectorWritable@288051
Key: 2: Value: org.apache.mahout.math.VectorWritable@288051
Count: 3

Is this what is to be expected?

Thanks,
Kris



2010/6/10 Jake Mannix <jake.mannix@gmail.com>

> Yeah, you simply can't cast between IntWritable and LongWritable, sadly.
> You need to convert your Long document ids to Integer.  Since you're
> pulling
> documents from Solr, the docIds should be sequential and start small,
> in which case they're all well under Integer.MAX_VALUE, and so a trivial
> MapReduce (well, Map, no Reduce) job with a Mapper like this should work:
>
> public class M extends Mapper<LongWritable, Writable, IntWritable,
> Writable>
> {
>  private final IntWritable i = new IntWritable(0);
>  public void map(LongWritable key, Writable value, Context c)
>  {
>     i.set((int)k.get());
>     c.collect(i, value);
>  }
> }
>
> Run that over your input first, and you should be set.
>
>  -jake
>
> On Thu, Jun 10, 2010 at 7:20 AM, Kris Jack <mrkrisjack@gmail.com> wrote:
>
> > Got a little further by making some more class changes...
> >
> > //
> > public class GenSimMatrixJob extends AbstractJob {
> >
> >    public GenSimMatrixJob() {
> >
> >    }
> >
> >    @Override
> >    public int run(String[] strings) throws Exception {
> >        addOption("numDocs", "nd", "Number of documents in the input");
> >        addOption("numTerms", "nt", "Number of terms in the input");
> >
> >        Map<String,String> parsedArgs = parseArguments(strings);
> >        if (parsedArgs == null) {
> >          // FIXME
> >          return 0;
> >        }
> >
> >        Configuration originalConf = getConf();
> >        String inputPathString = originalConf.get("mapred.input.dir");
> >        String outputTmpPathString = parsedArgs.get("--tempDir");
> >        int numDocs = Integer.parseInt(parsedArgs.get("--numDocs"));
> >        int numTerms = Integer.parseInt(parsedArgs.get("--numTerms"));
> >
> >        DistributedRowMatrix text = new
> > DistributedRowMatrix(inputPathString,
> >                outputTmpPathString, numDocs, numTerms);
> >
> >        text.configure(new JobConf(getConf()));
> >
> >        DistributedRowMatrix transpose = text.transpose();
> >
> >        DistributedRowMatrix similarity = transpose.times(transpose);
> >
> >        System.out.println("Similarity matrix lives: " +
> > similarity.getRowPath());
> >
> >         return 1;
> >    }
> >
> >    public static void main(String[] args) throws Exception {
> >        ToolRunner.run(new GenSimMatrixJob(), args);
> >    }
> >
> > }
> > //
> >
> > Giving the error...
> >
> > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> further
> > details.
> > 10-Jun-2010 15:16:28 org.apache.hadoop.metrics.jvm.JvmMetrics init
> > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient
> > configureCommandLineOptions
> > WARNING: Use GenericOptionsParser for parsing the arguments. Applications
> > should implement Tool for the same.
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient
> > configureCommandLineOptions
> > WARNING: No job jar file set.  User classes may not be found. See
> > JobConf(Class) or JobConf#setJar(String).
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus
> > INFO: Total input paths to process : 1
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> > INFO: Running job: job_local_0001
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus
> > INFO: Total input paths to process : 1
> > 10-Jun-2010 15:16:28 org.apache.hadoop.util.NativeCodeLoader <clinit>
> > WARNING: Unable to load native-hadoop library for your platform... using
> > builtin-java classes where applicable
> > 10-Jun-2010 15:16:28 org.apache.hadoop.io.compress.CodecPool
> > getDecompressor
> > INFO: Got brand-new decompressor
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask runOldMapper
> > INFO: numReduceTasks: 1
> > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > <init>
> > INFO: io.sort.mb = 100
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > <init>
> > INFO: data buffer = 79691776/99614720
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > <init>
> > INFO: record buffer = 262144/327680
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.LocalJobRunner$Job run
> > WARNING: job_local_0001
> > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> > cast to org.apache.hadoop.io.IntWritable
> >    at
> >
> >
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >    at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> > INFO:  map 0% reduce 0%
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> > INFO: Job complete: job_local_0001
> > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.Counters log
> > INFO: Counters: 0
> >
> >
> >
> > 2010/6/10 Kris Jack <mrkrisjack@gmail.com>
> >
> > > In the attempt to create a document-document similarity matrix, I am
> > > getting the following error:
> > >
> > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> > further
> > > details.
> > > 10-Jun-2010 13:25:04 org.apache.hadoop.metrics.jvm.JvmMetrics init
> > > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient
> > > configureCommandLineOptions
> > > WARNING: Use GenericOptionsParser for parsing the arguments.
> Applications
> > > should implement Tool for the same.
> > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient
> > > configureCommandLineOptions
> > > WARNING: No job jar file set.  User classes may not be found. See
> > > JobConf(Class) or JobConf#setJar(String).
> > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.FileInputFormat
> listStatus
> > > INFO: Total input paths to process : 1
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.JobClient
> > monitorAndPrintJob
> > > INFO: Running job: job_local_0001
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.FileInputFormat
> listStatus
> > > INFO: Total input paths to process : 1
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.util.NativeCodeLoader <clinit>
> > > WARNING: Unable to load native-hadoop library for your platform...
> using
> > > builtin-java classes where applicable
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.io.compress.CodecPool
> > > getDecompressor
> > > INFO: Got brand-new decompressor
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask runOldMapper
> > > INFO: numReduceTasks: 1
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > > <init>
> > > INFO: io.sort.mb = 100
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > > <init>
> > > INFO: data buffer = 79691776/99614720
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> > > <init>
> > > INFO: record buffer = 262144/327680
> > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.LocalJobRunner$Job run
> > > WARNING: job_local_0001
> > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be
> > > cast to org.apache.hadoop.io.IntWritable
> > >     at
> > >
> >
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> > >     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > >     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > >     at
> > >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient
> > monitorAndPrintJob
> > > INFO:  map 0% reduce 0%
> > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient
> > monitorAndPrintJob
> > > INFO: Job complete: job_local_0001
> > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.Counters log
> > > INFO: Counters: 0
> > > Exception in thread "main" java.lang.RuntimeException:
> > java.io.IOException:
> > > Job failed!
> > >     at
> > >
> >
> org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:163)
> > >     at
> > >
> >
> org.apache.mahout.math.hadoop.GenSimMatrixLocal.generateMatrix(GenSimMatrixLocal.java:24)
> > >     at
> > >
> >
> org.apache.mahout.math.hadoop.GenSimMatrixLocal.main(GenSimMatrixLocal.java:34)
> > > Caused by: java.io.IOException: Job failed!
> > >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> > >     at
> > >
> >
> org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:158)
> > >     ... 2 more
> > >
> > >
> > > I created a test solr index with 3 documents and generated a sparse
> > feature
> > > matrix out of it using mahout's
> > > org.apache.mahout.utils.vectors.lucene.Driver.
> > >
> > > I then ran the following code using the sparse feature matrix as input
> > > (mahoutIndexTFIDF.vec).
> > >
> > > {
> > >     private void generateMatrix() {
> > >         String inputPath = "/home/kris/data/mahoutIndexTFIDF.vec";
> > >         String tmpPath = "/tmp/matrixMultiplySpace";
> > >         int numDocuments = 3;
> > >         int numTerms = 4;
> > >
> > >         DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
> > >           tmpPath, numDocuments, numTerms);
> > >
> > >         JobConf conf = new JobConf("similarity job");
> > >         text.configure(conf);
> > >
> > >         DistributedRowMatrix transpose = text.transpose();
> > >
> > >         DistributedRowMatrix similarity = transpose.times(transpose);
> > >
> > >         System.out.println("Similarity matrix lives: " +
> > > similarity.getRowPath());
> > >     }
> > >
> > >     public static void main (String [] args) {
> > >         GenSimMatrixLocal similarity = new GenSimMatrixLocal();
> > >
> > >         similarity.generateMatrix();
> > >     }
> > > }
> > >
> > > Anyone see why there is a problem between LongWritable and IntWritable
> > > casting?  Does it need to be configured differently?
> > >
> > > Thanks,
> > > Kris
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message