uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....
Date Mon, 09 Mar 2009 09:18:34 GMT
Just a few more points on this fascinating topic.

* The JVM internally represents characters as UTF16.
This means that any ascii text will use twice as much
memory in the JVM as on disk.

* While reading in the file, you will likely do some
copying.  Even if you allocate a char[] of the right
size ahead of time and use that as a buffer to read
in your file, you'll copy that data when you create
a string out of it.  So you'll need double the
amount of the final String memory while reading it
in.  To the best of my knowledge, there is no way
around this issue, at least if you want to end up
with a regular Java string.

* Strings in the JVM use a char[] internally.  So you
are not only constrained by the maximum heap size, but
also by the maximum array size on the particular JVM
implementation you're using.  This detail is buried
deep down in your JVM documentation.  I don't know
what the numbers are nowadays, but they used to be
quite low in the Java 1.4 days.  This may have changed.

* On 32-bit windows, a process may use up to 2GB of
memory, not 4GB.  Subtract from that the memory that
the JVM needs, and you get to some number around 1.4GB
as the maximum JVM heap space you can allocate.

So the upshot is that on 32bit windows, you can't
read in ascii files into a String that are larger
than 350MB or so.  The number may be a lot smaller,
depending on your JVM and how clever your implementation

In addition, you want to do some UIMA analysis.
Consider that this needs space, too.  Depending on
your analysis, the size of the CAS may easily be
10 times the size of your text, or more.

So read in your large files in chunks no larger than
5 MB, is my recommendation.  If you have files that
big, you're probably not concerned with the fact that
you may be cutting up a word here and there.  Still,
you can try to place splits at end-of-sentence
characters or whitespace.


View raw message