xmlgraphics-fop-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abel Braaksma <abel.onl...@xs4all.nl>
Subject Re: Tools for reverse FOP?
Date Thu, 01 Nov 2007 11:15:30 GMT
siegfried wrote:
> Are there any tools that will accept a PDF and produce XML? Might this 
> be a feature of FOP someday?
> Thanks,
> Siegfried

That's highly improbable, because PDF is a non-structured format and 
going from non-structured to structured is a daunting (and often 
theoretically and practically impossible) task.

There are tools that extract the text from PDF and there are tools that 
extract the images from PDF. And some create Word (iirc) and/or RTF with 
layout. Going from RTF to XSL-FO is then rather easy (rtf is text 
based), but it will get extremely bloated (check out the RTF when you 
have all options set, the RTF is will get huge already for a couple of 
pages!). Much of this has to do with the precise positioning inside pdf. 
Still many objects or properties cannot be extracted at all (borders, 
backgrounds, alpha channels, overlays, partially embedded fonts).

I don't see a reason why FOP would do such a thing (if PDF can be 
treated as input, than Word, RTF, TIFF, BMP etc should also be 
considered, I guess, which makes it next to impossible), it is such a 
specialized task (compare OCR) that other tools are better suited.

Hope this answers your question,

-- Abel Braaksma

To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org

View raw message