mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sushil_kb <bajracha...@gmail.com>
Subject Re: Creating Vectors from Text
Date Wed, 28 Oct 2009 01:08:11 GMT

It seems that the problem is because that not all the documents in my index
has the field that I am using to get term vectors from. I made the following
changes to make this work, but I am not sure if thats the right way. I
wanted to get this work to run the LDA topic modeling using the output from
the Driver. 

Index:
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(working copy)
@@ -42,7 +42,7 @@
         break;
       }
       //point.write(dataOut);
-      writer.append(new LongWritable(recNum++), point);
+      if(point!=null) writer.append(new LongWritable(recNum++), point);
 
     }
     return recNum;
Index:
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(working copy)
@@ -104,6 +104,10 @@
       try {
         indexReader.getTermFreqVector(doc, field, mapper);
         result = mapper.getVector();
+        
+        if (result == null)
+        	return null;
+        
         if (idField != null) {
           String id = indexReader.document(doc,
idFieldSelector).get(idField);
           result.setName(id);





sushil_kb wrote:
> 
> I am having the same problem as Allan. I checked out mahout from trunk and
> tried to create term frequency vector from a lucene index and ran into
> this..
> 
> 09/10/27 17:36:10 INFO lucene.Driver: Output File:
> /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
> 	at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
> 	at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
> 	at
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
> 	at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
> 
> I am running this from Eclipse (snow leopard with JDK 6), on an index that
> has field with stored term vectors..
> 
> my input parameters for Driver are: 
> --dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
>  --field field_with_TV --dictOut <path>/luc2tvec.dict --max 50  --weight
> tf
> 
> Luke shows the following info on the fields I am using:
>  id_field is indexed, stored, omit norms
>  field_with_TV is indexed, tokenized, stored, term vector
> 
> I can run the test LuceneIterableTest fine but when I run the Driver on my
> index I get into trouble. Any possible reasons for this behavior besides
> not having an index field with stored term vector?
> 
> Thanks.
> - sushil
> 
> 
> 
> 
> Grant Ingersoll-6 wrote:
>> 
>> 
>> On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:
>> 
>>> Regards,
>>> This is the entire exception message:
>>>
>>>
>>> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>>> /home/hadoop/Desktop/<urls>/index  --field content  --dictOut
>>> /home/hadoop/Desktop/dictionary/dict.txt --output
>>> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>>>
>>>
>>> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
>>> /home/hadoop/Desktop/dictionary/out.txt
>>> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &  
>>> initialized
>>> native-zlib library
>>> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
>>> Exception in thread "main" java.lang.NullPointerException
>>>        at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>>> $TDIterator.next(LuceneIteratable.java:111)
>>>        at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>>> $TDIterator.next(LuceneIteratable.java:82)
>>>        at
>>> org 
>>> .apache 
>>> .mahout 
>>> .utils 
>>> .vectors 
>>> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>>>        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>>
>>>
>>> Well, I used a nutch crawl index, is that correct? mmm... I have  
>>> change to
>>> contenc field, but nothing happened.
>>> Possibly the nutch crawl doesn´t have Term Vector indexed.
>> 
>> This would be my guess.  A small edit to Nutch code would probably  
>> allow it.  Just find where it creates a new Field and add in the TV  
>> stuff.
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26087765.html
Sent from the Mahout User List mailing list archive at Nabble.com.


Mime
View raw message