xmlgraphics-fop-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: AW: Memory Leak issue -- FOP
Date Fri, 10 Sep 2010 08:50:12 GMT
On 10.09.2010 10:13:43 Craig Ringer wrote:
> On 09/10/2010 03:44 PM, Georg Datterl wrote:
> > Hi Hamed,
> >
> > I did some pretty large publications with lots of images. 1500 pages
> > took 2GB memory, after I put some effort in memory optimization. The
> > only FOP-related issue I found was image caching and that can be
> > disabled. I’m quite sure I would have found a memory leak in FOP,
> > especially one related to ordinary LayoutManagers. So either make your
> > page-sequences shorter or give fop more memory.
> 
> I can't help but wonder if FOP needs to keep the whole page sequence in 
> memory, at least for PDF output. Admittedly I haven't verified that it 
> *is* keeping everything in RAM, but that's certainly a whole lot of RAM 
> for a moderate-sized document.

I doesn't need to keep whole page-sequences (FO tree) in memory in
theory. In practice, it does currently. FOP is still waiting for someone
to tackle that problem which is going to be quite a job.

> I've been meaning to look at how fop is doing its PDF generation for a 
> while, but I've been head-down trying to finish a web-based UI for work 
> first. I do plan to look at it though as I've done a fair bit of work on 
> PDF generation libraries and I'm curious about how Fop is doing it (and 
> how much wheel-reinvention might be going on).
> 
> Anyway, PDF is *designed* for streaming output, so huge PDFs can be 
> produced using only very small amounts of memory with a bit of thought 
> into how the output works. I've had no issues generating 
> multi-hundred-megabyte PDF documents with very small amounts of RAM 
> using PoDoFo, a C++ PDF library that supports direct-to-disk PDF generation.

FOP's PDF library is highly optimized for streaming output and uses very
little RAM. I can easily produce image-heavy photo books with a size of
2GB and as little as 64MB RAM. What's taking a lot of RAM is the FO tree
for which we currently have no means to release subtrees that have been
fully processed. Right now this happens for a page-sequence when it's
fully processed. The layout engine itself also takes some memory. And
finally the area tree, but this one can release single pages once they
are fully resolved and written out. The intermediate format is strictly
streaming and does not use a noticeable amount of memory.

> There are all sorts of tricks you can do. The most important is of 
> course that you can make back- or forward- indirect references to almost 
> any object, with no constraints on object order in the document. You can 
> write whatever you generate out very aggressively. You can even split 
> your content stream(s) for each page into multiple segments so you can 
> write the content stream out when it gets too big. Or write the content 
> stream to a tempfile, then merge it into the PDF after the other 
> resources for the page have been written.

Right, these are the work-arounds that need to be made right now.

> There should be no need for image caching, because once you've written 
> the image object to the PDF once, you can just reference it again in 
> later pages. Not only does that save RAM but it makes your PDF smaller 
> and faster. It works even if your image is used in different sizes, 
> scales, etc in different parts of the document, because you can crop and 
> scale using content-stream instructions.

For one document, an image is only loaded once and written once and
re-used even if there is no image cache. The image cache is there only
to cache images between rendering runs. Then, FOP is optimized to
"preload" images, i.e. it only extracts the intrinsic size of an image
without actually loading and processing it (where possible). Only the
output format will finally load the image fully and process it for
output.

> You don't even have to keep the page dictionaries in RAM. You can write 
> them out when the page is done (or before). Because forward-indirect 
> references are permitted, if you have content on the page that's yet to 
> be generated you can reserve some object IDs for those content streams 
> and output indirect references to the as-yet nonexistent content streams 
> in the page dictionary.

That already happens to a certain degree.

> About the only time I can think of when you have to keep something in 
> memory (or at least, in a tempfile) is when you have content in a page 
> (like total page counts) that cannot be generated until later in the 
> document - and may re-flow the rest of the page's content. If the 
> late-generated content won't force a reflow it can just be put in a 
> separate content stream with a forward-reference.
> 
> Admittedly, I'm speaking only about the actual PDF generation. It may 
> well be that generating the AT/IF is inherently demanding of resident 
> RAM, or that the IF/AT don't contain enough information to generate 
> pages progressively.

Not at all. It's all the FO tree and the layout engine that can still be
improved.

> The point, though, is that PDF output shouldn't use much RAM if the PDF 
> output code is using PDF features to make it efficient. Sometimes it's a 
> trade-off between how efficient the produced PDF is and how efficient 
> its creation is, but you can always post-process (optimize) a PDF once 
> it's generated if you want to do things like linearize it for fast web 
> loading.

As you've seen you're suspecting the problems in the wrong part of FOP.
PDF linearization would indeed be impossible right now (without
post-processing) exactly because we are streaming PDF for low memory
consumption.


Jeremias Maerki


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-help@xmlgraphics.apache.org


Mime
View raw message