uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....
Date Mon, 09 Mar 2009 14:51:42 GMT
Thanks, Thilo, good points!

Another fine point below

Thilo Goetz wrote:
> Just a few more points on this fascinating topic.
> * The JVM internally represents characters as UTF16.
> This means that any ascii text will use twice as much
> memory in the JVM as on disk.
> * While reading in the file, you will likely do some
> copying.  Even if you allocate a char[] of the right
> size ahead of time and use that as a buffer to read
> in your file, you'll copy that data when you create
> a string out of it.  So you'll need double the
> amount of the final String memory while reading it
> in.  To the best of my knowledge, there is no way
> around this issue, at least if you want to end up
> with a regular Java string.
> * Strings in the JVM use a char[] internally.  So you
> are not only constrained by the maximum heap size, but
> also by the maximum array size on the particular JVM
> implementation you're using.  This detail is buried
> deep down in your JVM documentation.  I don't know
> what the numbers are nowadays, but they used to be
> quite low in the Java 1.4 days.  This may have changed.
> * On 32-bit windows, a process may use up to 2GB of
> memory, not 4GB.  Subtract from that the memory that
> the JVM needs, and you get to some number around 1.4GB
> as the maximum JVM heap space you can allocate.
Actually, there seems to be a way to get Windows XP and Server to let
users have 3GB, not 2GB, but you have to change a setting.  See

> So the upshot is that on 32bit windows, you can't
> read in ascii files into a String that are larger
> than 350MB or so.  The number may be a lot smaller,
> depending on your JVM and how clever your implementation
> is.
> In addition, you want to do some UIMA analysis.
> Consider that this needs space, too.  Depending on
> your analysis, the size of the CAS may easily be
> 10 times the size of your text, or more.
> So read in your large files in chunks no larger than
> 5 MB, is my recommendation.  If you have files that
> big, you're probably not concerned with the fact that
> you may be cutting up a word here and there.  Still,
> you can try to place splits at end-of-sentence
> characters or whitespace.
> --Thilo

View raw message