tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ehsan Sadeghi <esadegh...@gmail.com>
Subject PDF text extraction problems
Date Fri, 04 Jun 2010 09:51:07 GMT
Hello,
I have sent this to the Tika Linked before and got an answer from Jukka
Zitting,

It may be that the PDFBox library Tika uses for handling PDF documents is
having a problem with parsing your files. Do you have an example file that
you can share?

BR,


so here is the original mail and attachment.
PDF file 1:
https://docs.google.com/fileview?id=0B2X-v8a_ekanYmMyMzg1NTktMmFlMi00YjU2LTk2OWQtMTg2NTI1YWI4NTZh&hl=en

PDF
https://docs.google.com/fileview?id=0B2X-v8a_ekanMTUyNjExMjUtMTI5Yy00NDc4LTg0YmYtODg4NmNkMGIxMmZk&hl=en


I'm trying to parse a pdf file. I first tried this code

          InputStream input = new FileInputStream(new
File(resourceLocation));// the document to be parsed
          ContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();
          PDFParser parser = new PDFParser();
          ParseContext context = new  ParseContext();
          parser.parse(input, textHandler, metadata, context);
          input.close();

then I tried the Tika class

        Tika tika = new Tika();
        InputStream input = new FileInputStream(new File(resourceLocation));
        Metadata metadata = new Metadata();
        String content = tika.parseToString(input, metadata);

both of these codes do the exact same thing, they read some of the text in
the PDF file, but leave the rest of the file out?? I tested it with a 1m
file and a 100k file.
 I looked around and found this message in the tika mails "Tika
maxStringLength limit reached" where it was suggested that one could add the
maxStringLength by doing this
  tika.setMaxStringLength(10*
1024*1024);

no result. Am I doing something wrong?how can I parse the entire file.

cheers
ehsan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message