tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Diao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
Date Wed, 27 Mar 2013 21:35:15 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615781#comment-13615781
] 

Qian Diao commented on TIKA-1098:
---------------------------------

Here is the stachtrace:

org.apache.tika.exception.TikaException: Unable to extract PDF content
url_1763_approx-alg-notes.pdf    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:80)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:140)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.cisco.nc.autovocab.Test.parseFile(Test.java:36)
    at com.cisco.nc.autovocab.Test.main(Test.java:70)
Caused by: java.io.IOException: Error: Unknown annotation type null
    at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
    at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:797)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:142)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63)
    ... 6 more
                
> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> --------------------------------------------------------
>
>                 Key: TIKA-1098
>                 URL: https://issues.apache.org/jira/browse/TIKA-1098
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: linux redhat
>            Reporter: Qian Diao
>         Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
>     private static final String validBoilerpipeFilenameRegEx = ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
>         public String parseFile(File inFile) {
>             if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;
>                    
>             InputStream is = null;
>             String outputText = "";
>             try {
>                 // Open input stream
>                 is = new FileInputStream(inFile);
>                 // Prepare parser
>                 BodyContentHandler contenthandler = new BodyContentHandler(-1);
>                 Metadata metadata = new Metadata();
>                 metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
>                 ParseContext pc = new ParseContext();
>                 // Call parse with boilerpipe if valid boilerpipe extension; otherwise,
call regular parse.
>                 if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
>                         Parser parser = new AutoDetectParser();
>                         parser.parse(is, contenthandler, metadata, pc);
>                 }
>                 else {
>                         Parser parser = new HtmlParser();
>                         BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler,
new ArticleExtractor());
>                         parser.parse(is, bh, metadata, pc);
>                 }
>                 // Prepare text for write
>                 outputText = contenthandler.toString();        
>             } catch (Exception e) {
>                 System.out.println(e);
>                 return null;
>             } finally {
>                 try { 
>                     if (is != null) 
>                         is.close(); 
>                 } catch (Exception e) {}
>             }
>            
>             return outputText;
>         }
> =====output====
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message