uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thilo Goetz (JIRA)" <uima-...@incubator.apache.org>
Subject [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS
Date Fri, 06 Jun 2008 14:21:45 GMT

     [ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Thilo Goetz closed UIMA-1067.

    Resolution: Fixed

Fixed, all unit tests pass.  Please test this change if you use (binary) serialization.  It
should work the same as before, I haven't changed the serialization format in any way.

> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>                 Key: UIMA-1067
>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.2.2
>            Reporter: Thilo Goetz
>            Assignee: Thilo Goetz
>             Fix For: 2.3
> The StringHeap class provides two ways to store strings: either as Java strings, or by
copying characters onto a character heap.  The second option is only used for deserialization
from a binary CAS.  However, even if not used, this capability means a very significant memory
overhead.  To demonstrate this, I ran the following experiment.  As analysis engine, I used
our sandbox POS tagger.  It sets just one string feature on each token.  As text, I used a
2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens to be the
JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB increments.  The I patched
the StringHeap implementation to work without the additional book keeping overhead and ran
the experiment again.  I was then able to run with -Xmx115M.  This represents a very significant
gain, particularly given the fact that I ran so little analysis (only tokens and sentences
are produced, and only a single string-valued feature set).  The new code also ran a tiny
bit faster, but not much.  One might see more improvement for analysis that is not as compute
intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after this change.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message