tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Diao (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1097) not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser‏‏
Date Mon, 25 Mar 2013 21:03:15 GMT
Qian Diao created TIKA-1097:
-------------------------------

             Summary: not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser‏‏
                 Key: TIKA-1097
                 URL: https://issues.apache.org/jira/browse/TIKA-1097
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.3, 1.1
         Environment: linux redhat 
            Reporter: Qian Diao
             Fix For: 1.3, 1.1


Hi,

I got some parsing problems when using Tika 1.1. Some pdfs, docs and ppts were not getting
parsed.
So, tried with 1.3. Still some pdfs/docs/ppts can not be parsed.

my code (Test.java):


import java.io.File;

import java.io.InputStream;

import java.io.FileInputStream;



import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.html.BoilerpipeContentHandler;

import org.apache.tika.sax.BodyContentHandler;

import org.apache.tika.parser.html.HtmlParser;

import de.l3s.boilerpipe.extractors.ArticleExtractor;



public class Test {

    private static final String validBoilerpipeFilenameRegEx = ".*(\\.)(htm|html|shtml|php|asp|aspx)$";



        public String parseFile(File inFile) {

            if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;

                   

            InputStream is = null;

            String outputText = "";

            try {

                // Open input stream

                is = new FileInputStream(inFile);



                // Prepare parser

                BodyContentHandler contenthandler = new BodyContentHandler(-1);

                Metadata metadata = new Metadata();

                metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());

                ParseContext pc = new ParseContext();

                // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call
regular parse.

                if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {

                        Parser parser = new AutoDetectParser();

                        parser.parse(is, contenthandler, metadata, pc);

                }

                else {

                        Parser parser = new HtmlParser();

                        BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler,
new ArticleExtractor());

                        parser.parse(is, bh, metadata, pc);



                }



                // Prepare text for write

                outputText = contenthandler.toString();        

            } catch (Exception e) {

                System.out.println(e);

                return null;

            } finally {

                try { 

                    if (is != null) 

                        is.close(); 

                } catch (Exception e) {}

            }

           

            return outputText;

        }





======

output:

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@3a6ac461

url_4080_ETS11_TAGMatrix_rev070111.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2b03be0

url_2275_Paper26Pages253-269.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f9a32e0

url_5889_viz.96.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4e513d61

url_1556_sensys_awoo03.pdf

org.apache.tika.exception.TikaException: Unable to extract PDF content

url_1763_approx-alg-notes.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@426295eb

url_5300_sudoku2.pdf?referrer=webcluster&

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7c2e1f1f

url_1441_ChoosingYourFirstCSCourse2011-FINAL.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7eda18ac

url_4272_20080218121324_723.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f0ffb38

url_2491_2106_crime_scene.doc

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4cedf389

url_5227_Romano-Library%20Research%20Series%20-%20March%2029%202007%20FINAL(small).ppt

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6126f827

url_5250_linked%20list.ppt

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3749eb9f

url_2011_undergrad-brochure.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3a289d2e

url_5709_final_presentation_bak.ppt

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ddc0e7a

url_5319_2011_2012_advising_guidelines.pdf

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@7dc5ddc9

url_3502_TheEvolvingRoleTech.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4963f7a1

url_2403_class_presentation_Btree.ppt

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7ba85d38

url_4040_fukunaga_jair07_bin.pdf

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6a8046f4

url_2472_COP3530OverheadsF99.doc



Thanks,

Qian

 
       

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message