lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: For indexing: how to estimate needed memory?
Date Fri, 04 May 2007 13:48:42 GMT
A minor clarification. What I'm measuring is the size of the index
before I index a doc, and the size after (this involves "expert" usage
of my IndexWriter subclass....). Using those two figures, I get
a "bloat" factor. That is, how much the memory was consumed by
this document in the index. Say it's 7x the size of the original doc.

So when a doc comes in to be indexed, I multiply the incoming doc
size x7 x2 and flush if necessary.

I absolutely agree that this is a crude measure and doesn't take
into account memory consumption as that document is

And I *strongly* suspect that the ridiculous factors I'm getting
(the index grows, at times, by 7x the size of the original doc)
are an artifact of some sort of "allocate some extra memory for
structure X" rather than a true measure of how much space that
document uses. Especially since I'm indexing 11G of raw
data in a 4G index. While storing lots of raw data. I'm imagining that
I index, say, a 20K document. And my index grows by 140K. All
because some internal structure say doubled in size. I have no
evidence for this, except that my index isn't 77G. Which suggests
that a better measure would be the total raw byte sizes of
all the documents compared to the total size of the index in
ram. Between flushes. Hmmmmm......

I guess in my situation, I'm indexing only as a batch process. Currently,
we build an index then deploy it then don't change except once in a
great while (months/years). So I can afford crude measurements and
figure out what to fix up when it explodes, since it has no chance
of affecting production systems. Which is a great luxury.....

I agree with all your points, it sounds like you've been around this
block once or twice. As I said, my main motivation is to have a
way to avoid experimenting with the various index factors, merge,etc.
on the *next* project. And because I got interested <G>. No doubt
I'll run into a situation where efficiency really does count, but so far
overnight is fast enough for building my indexes......

After noodling on this for a while, I'm probably going to throw it all
out and just flush when I have less than 100M free. Memory is
cheap and that would accomplish my real goal which is just to
have a class I can use to index the next project that doesn't
require me to fiddle with various factors and gives "good enough"
performance. If the indexing speed is painful, I'll revisit this. But
I suspect that squeezing the use of the last 90M in this case
won't buy me much. Now that I'm thinking of it, it would be
easy to measure how much better using the last 100M by
just comparing the times for building the index with an extra
100M allocated to the JVM. *Then* figure out whether the
speed gain was worth the complexity.....

It's nice to use writing things like this to figure out what I should
*really* be concerned with.


On 5/4/07, david m <> wrote:
> > First, are you sure you're free memory calculation is OK? Why not
> > just use freeMemory?
> I _think_ my calculation is ok - my reasoning:
> Runtime.maxMemory() - Amount of memory that can be given to the JVM -
> based on -Xmx<value>
> Runtime.totalMemory() - Amount of memory currently owned by the JVM
> Runtime.freeMemory() - Amount of unused memory currently owned by the JVM
> The amount of memory currently in use (inUse) = totalMemory() -
> freeMemory()
> The amount we can still get (before hitting -Xmx<value>) = maxMemory() -
> inUse
> And in the absence of -Xms, nothing to say we will be given that much.
> > Perhaps also calling the gc if the avail isn't
> > enough. Although I confess I don't know the innards of the
> > interplay of getting the various memory amounts.....
> I do call the gc - but sparingly. If I've done a flush to reclaim
> memory in hopes of having enough memory for a pending document, then
> I'll call the gc before checking if I now have enough memory
> available. However, I too know little of the gc workings. On the
> assumption that the JRE is smarter at knowing how & when to
> execute gc than I am, I operate on the premise that it is not a good
> practice to routinely be calling the gc.
> > The approach I've been using is to gather some data as I'm indexing
> > to decide whether to flush the indexwriter or not. That is, record the
> > size change that ramSizeInBytes() returns before I start to index
> > a document, record the amount after, and keep the worst
> > ratio around. This got easier when I subclassed IndexWriter and
> > overrode the add methods.
> I agree this gives a conservative measure of the worse case memory
> consumption by an indexed document. But it measures memory _after_
> indexing. My observation is that the peak memory usage occurs _during_
> indexing - so that if the process is low on memory, that is when the
> problem (OutOfMemoryError) will hit. In my mind it is the peak usage
> that really matters.
> If there were a way to record and retrieve peak usage for each
> document, we would be able to see if there is a relationship between
> the peak during indexing and ramSizeInBytes() after indexing. If there
> were a (somewhat) predictable relationship, then I think we'd have a
> more accurate value to decide on a factor to use for avoiding
> OutOfMemeoryErrors.
> > Then I'm requiring that I have 2X the worst case I've seen for the
> > incoming document, and flushing (perhaps gc-ing) if I don't have
> > enough.
> Based on the data I've collected, we've been using 1.5x - 2.0x of
> document size as our value (and made it a configuration parameter).
> > And I think that this is "good enough". What it allows (as does your
> > approach) is letting the usual cases of much smaller than 20M+ files
> > to accumulate and flush reasonably efficiently, and not penalizing
> > my speed by, say, always keeping 250M free or some such.
> Agreed... To me it is a balancing act of avoiding OutOfMemoryErrors
> without unnecessarily throttling throughput in order to keep that 250M
> (or whatever) of memory available for what we think is the unusual
> document - and one that arrives for indexing while available memory
> is relatively low. If it arrives when the indexer isn't busy with other
> documents, then likely not a problem anyway.
> >
> > Keep me posted if you come up with anything really cool!
> Ditto.
> Thanks, david.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message