tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Parsing incomplete PDF and Office files
Date Thu, 13 Nov 2008 21:52:23 GMT

On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic <for.milos@gmail.com> wrote:
> I would like to download just a few kilobytes of a PDF(doc) file and to
> extract the text from it. I do not want to download the whole file and then
> to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
> not how should I do that?

That's currently not possible, but AFAIK there is support for
page-by-page streaming in PDFBox (for PDF documents that support that,
not all of them do). It would be nice if Tika could leverage that
functionality in PDFBox.

However, I'm not sure how well that would work with truncated streams.
I guess the reasonable approach would be to stream as much text as can
be parsed, and then fail with a TikaException if the input stream ends
unexpectedly. Your application would then need to be aware of this
error condition and handle it appropriately.


Jukka Zitting

View raw message