Well I got no where trying to index openoffice documents so I thought I try
indexing PDF documents. Seemed Like PDFBox was a good bet, claimed to offer
Lucene support and was on the Lucene recommended list. But after numerious
attempts failed I decided try the IndexFiles.java that comes with PDFBox and
I get the same error my modified Lucene demo code gets.
C:\PDFBox-0.7.3\classes>java
org.pdfbox.searchengine.lucene.IndexFiles -create -index c:\index c:\test
root=c:\test
Skipping c:\test\HTMLParser.java
Skipping c:\test\SearchFiles.java
Indexing PDF document: c:\test\doc.pdf
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.lucene.document.Document.add(Lo
rg/apache/lucene/document/Field;)V
at
org.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224)
at
org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265)
at
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)
at
org.pdfbox.searchengine.lucene.IndexFiles.addDocument(IndexFiles.java:295)
at
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:269)
at
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:236)
at
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:223)
at
org.pdfbox.searchengine.lucene.IndexFiles.index(IndexFiles.java:165)
at
org.pdfbox.searchengine.lucene.IndexFiles.main(IndexFiles.java:140)
This is quite curious since my code to index text documents does this
suscessfully
/*
* Add title
*/
document.add(new Field("title", title, Field.Store.YES,
Field.Index.UN_TOKENIZED));
And looking at the failing PDFBox code it is doing the EXACT SAME THING
document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) );
Very strange since the exception is NoSuchMethod Document.add(Field)
And my custom code doing a doc.add(Field) works but PDFBox's code doing a
doc.add(Field) does not.
As a classpath problem check I tried this
public class IndexMain
{
public void indexDoc(String filename, String title, String objectId,
String nodeId) throws Exception
{
File INDEX_DIR = new File("index");
KcmiDocument kcmiDoc=null;
Document pdfDocument=null;
LucenePDFDocument lpdf = new LucenePDFDocument();
IndexWriter writer = new IndexWriter(INDEX_DIR, new
StandardAnalyzer());
File file = new File(filename);
if (filename.endsWith("pdf"))
pdfDocument = lpdf.getDocument(file);
else
kcmiDoc = new KcmiDocument(objectId, title);
}
Where KcmiDocument does the doc.add(Field) and lpdf.getDocument does the
doc.add(Field)
when I send in a .txt file all is well, when I send in a .pdf file the
exception is thrown.
If anyone knows that I am doing wrong or of another easy method to extract
text from a pdf file I would centrainly like to know. I can live without
openoffice (for a while) but not being able to index pdf would be a Lucene
show stopper.
thanks
jim s
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|