uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thilo Goetz (JIRA)" <uima-...@incubator.apache.org>
Subject [jira] Created: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS
Date Fri, 06 Jun 2008 10:22:45 GMT
Remove char heap/ref heap in StringHeap of the CAS

                 Key: UIMA-1067
                 URL: https://issues.apache.org/jira/browse/UIMA-1067
             Project: UIMA
          Issue Type: Improvement
          Components: Core Java Framework
    Affects Versions: 2.2.2
            Reporter: Thilo Goetz
            Assignee: Thilo Goetz
             Fix For: 2.3

The StringHeap class provides two ways to store strings: either as Java strings, or by copying
characters onto a character heap.  The second option is only used for deserialization from
a binary CAS.  However, even if not used, this capability means a very significant memory
overhead.  To demonstrate this, I ran the following experiment.  As analysis engine, I used
our sandbox POS tagger.  It sets just one string feature on each token.  As text, I used a
2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens to be the
JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB increments.  The I patched
the StringHeap implementation to work without the additional book keeping overhead and ran
the experiment again.  I was then able to run with -Xmx115M.  This represents a very significant
gain, particularly given the fact that I ran so little analysis (only tokens and sentences
are produced, and only a single string-valued feature set).  The new code also ran a tiny
bit faster, but not much.  One might see more improvement for analysis that is not as compute
intensive as the Tagger.

The challenge is to make sure that the serialization code still works after this change.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message