tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Parsing incomplete PDF and Office files
Date Fri, 14 Nov 2008 10:48:32 GMT
Hi,

On Fri, Nov 14, 2008 at 8:32 AM, Milos Kovacevic <for.milos@gmail.com> wrote:
> could you please give an example how to parse PDF page-by-page?

You'll want to contact pdfbox-users@incubator.apache.org for that.

I know that PDFBox is able to parse linear PDF documents (i.e. ones
that are internally stored in a page-by-page order), but AFAIK that
streaming capability is currently not used in the higher level
features like the PDFTextStripper class (even though it already does
use an event model).

BR,

Jukka Zitting

Mime
View raw message