uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: small memory footprint tradeoff configuration
Date Tue, 10 Mar 2009 22:49:41 GMT
After reviewing the previous chain of discussion on this topic, I would
like to start the next round, hopefully getting to some convergence :-).

1) On the topic of doing GC (garbage collection) versus copy to another
CAS - GC is conceptually perhaps less complex - you don't have mutliple
CASes around.  I note that Adam (and maybe others) preferred this.

2) Keeping IDs: within the context of an aggregate, doing a GC and
having the resulting FeatureStructure IDs change is an issue.  Some
suggestions to lessen this:
  a) No automatic GC, just do when requested.
  b) If done in an Aggregate (could be an inner one), guarantee that IDs
that existed upon entry to the aggregate would be preserved.  This could
be done using the high-water-mark mechanism that was put in for
delta-cas.  This seems to have some nice properties concerning what a
user of an aggregate has to know about the aggregate's inner behavior -
in particular, the user would not need to know about that aggregate's
call to GC, since the outer aggregate's "handles" to feature structures
would not move due to the GC.

Part of this concept could be to allow an option to have the GC to move
everything; maybe a different explicit call.

3) There is a potential to trade performance for space using String
"interning" - to insure string values set for features are stored just
once.  This is typically done using a hashmap of some kind, so there's
overhead for that, so it may or may not actually reduce space.  There is
also a potential to store Strings using UTF-8 encoding - which may or
may not save space (depends on the string, etc.)

For String interning, we could have two different kinds of approaches to
specifying its use: a global, application-wide setting or a specific
setting (e.g., add a new basic type to UIMA, called, for instance,
cas.uima.SharedString). Using a new type would allow users to pick just
the cases where they wanted the extra machinery to coalesce equal
strings to one shared object.

4) There is a potentially big space reduction possible by being able to
mark some fields of feature structures as never being "read".  Such
fields could then be not stored.  For instance, a feature structure of
type TOKEN might have many fields, representing various information -
only some of which might be used in a particular application.  Even if
the ResultSpecification for a tokenizer is set to indicate not to
"produce" these fields, today, space is consumed for those fields (they
are filled with "null" or 0).  If there are many instances of this
feature structure type (such as Token), this can be a significant space
saver.  To identify a field as never being read, one could look at the
aggregate's component's capability specification - and mark the field if
it doesn't appear in any of the delegate's input specs.  For an
outermost aggregate, one would probably want to add that aggregate's
output specification - to capture the outer application's potential use
of fields.

There is another layer of space granularity we could consider. This
would be to let the assembler divide the components into "groups"
(perhaps just using the natural grouping aggregates provide), and
compute (or specify) which fields of types were "not used" by group. For
instance, consider group1 which has lots of extra fields in Token, which
are used during group1 components, followed by group2 processing, which
doesn't use most of the fields in Token.  A GC operation could then
"compress" the representation of Token (of those where preserving the
FeatureStructure ID wasn't required). 

To avoid creation of new collections of components, we could restrict
the "groups" to be just what a particular aggregate contained.  Using
this approach, we could envision using the input/output capability
specifications to automatically deduce which extra fields could be
eliminated.  We could also have an automatic GC mode - which would
invoke the GC (that didn't alter pre-existing feature structures) at the
end of all aggregates.  Although this might do too many gc's, it would
be conceptually simple.


View raw message