tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject Re: Parsing incomplete PDF and Office files
Date Fri, 14 Nov 2008 00:22:37 GMT
On a related note, does Tika support full text extraction of PDFs?

On Nov 13, 2008, at 1:52 PM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic  
> <for.milos@gmail.com> wrote:
>> I would like to download just a few kilobytes of a PDF(doc) file  
>> and to
>> extract the text from it. I do not want to download the whole file  
>> and then
>> to parse it, just truncated first N Kbs. Is it possible with Tika  
>> or not? If
>> not how should I do that?
>
> That's currently not possible, but AFAIK there is support for
> page-by-page streaming in PDFBox (for PDF documents that support that,
> not all of them do). It would be nice if Tika could leverage that
> functionality in PDFBox.
>
> However, I'm not sure how well that would work with truncated streams.
> I guess the reasonable approach would be to stream as much text as can
> be parsed, and then fail with a TikaException if the input stream ends
> unexpectedly. Your application would then need to be aware of this
> error condition and handle it appropriately.
>
> BR,
>
> Jukka Zitting

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Mime
View raw message