nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Quick <edwardqu...@hotmail.com>
Subject problems parsing pdf's
Date Sun, 07 Sep 2008 20:59:51 GMT







Hi,

I keep getting the following errors when parsing pdf's:

Error parsing: http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/DeT+three+wishes/$FILE/Three+wishes.pdf:
failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary

fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Uniform+Wearers+Guide/$FILE/BAUWS.pdf
failed with: java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage

I have applied the patch mentioned here=>
https://issues.apache.org/jira/browse/NUTCH-643
but this didn't stop the ClassCastExceptions for everything.

Currently I've got about 243 pdfs on our Intranet which I cant get Nutch to parse :-(

Cheers,

Ed.

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Mime
View raw message