tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ehsan Sadeghi <esadegh...@gmail.com>
Subject PDF text extraction problems
Date Fri, 04 Jun 2010 09:51:07 GMT
I have sent this to the Tika Linked before and got an answer from Jukka

It may be that the PDFBox library Tika uses for handling PDF documents is
having a problem with parsing your files. Do you have an example file that
you can share?


so here is the original mail and attachment.
PDF file 1:


I'm trying to parse a pdf file. I first tried this code

          InputStream input = new FileInputStream(new
File(resourceLocation));// the document to be parsed
          ContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();
          PDFParser parser = new PDFParser();
          ParseContext context = new  ParseContext();
          parser.parse(input, textHandler, metadata, context);

then I tried the Tika class

        Tika tika = new Tika();
        InputStream input = new FileInputStream(new File(resourceLocation));
        Metadata metadata = new Metadata();
        String content = tika.parseToString(input, metadata);

both of these codes do the exact same thing, they read some of the text in
the PDF file, but leave the rest of the file out?? I tested it with a 1m
file and a 100k file.
 I looked around and found this message in the tika mails "Tika
maxStringLength limit reached" where it was suggested that one could add the
maxStringLength by doing this

no result. Am I doing something wrong?how can I parse the entire file.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message