tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phani Kumar Samudrala <phanikuma...@arisglobal.co.in>
Subject RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
Date Tue, 12 Feb 2013 10:25:45 GMT
Sorry, Just realized, seems I posted to wrong mailing list. Please ignore this.

-----Original Message-----
From: Phani Kumar Samudrala [mailto:phanikumar.s@arisglobal.co.in]
Sent: Tuesday, February 12, 2013 3:53 PM
To: dev@tika.apache.org
Subject: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception.
I am getting this error for some PDF documents only and for some PDFs it is working fine.
I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please
let me if any of you have seen this error and how to fix this?

Here is the exception:


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to
org.apache.pdfbox.cos.COSDictionary
      at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more


Here is the code snippet in JAVA:


String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
                                     File file = new File(fileString );
                                     URL url = file.toURI().toURL();

                                     ParseContext context = new ParseContext();;
                                     Detector detector = new DefaultDetector();;
                                     Parser parser =  new AutoDetectParser(detector);;
                                     Metadata metadata = new Metadata();
                                     context.set(Parser.class, parser); //PPt,word,xlsx--
pdf,html
                                     ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
                                                InputStream input = TikaInputStream.get(url,
metadata);
                                                ContentHandler handler = new BodyContentHandler(outputstream);
                                                parser.parse(input, handler, metadata, context);

                                                input.close();
                                                outputstream.close();


Thanks

________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may
be privileged. It is intended solely for the intended recipient. If you are not the intended
recipient, you have received this transmission in error and you are hereby advised that any
review, disclosure, copying, distribution, or use of this transmission, or any of the information
included therein, is unauthorized and strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by reply and permanently delete all copies
of this transmission and its attachments.


________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may
be privileged. It is intended solely for the intended recipient. If you are not the intended
recipient, you have received this transmission in error and you are hereby advised that any
review, disclosure, copying, distribution, or use of this transmission, or any of the information
included therein, is unauthorized and strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by reply and permanently delete all copies
of this transmission and its attachments.


Mime
View raw message