uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: small memory footprint tradeoff configuration
Date Wed, 11 Mar 2009 09:11:45 GMT
Marshall Schor wrote:
> After reviewing the previous chain of discussion on this topic, I would
> like to start the next round, hopefully getting to some convergence :-).
> 1) On the topic of doing GC (garbage collection) versus copy to another
> CAS - GC is conceptually perhaps less complex - you don't have mutliple
> CASes around.  I note that Adam (and maybe others) preferred this.
> 2) Keeping IDs: within the context of an aggregate, doing a GC and
> having the resulting FeatureStructure IDs change is an issue.  Some
> suggestions to lessen this:
>   a) No automatic GC, just do when requested.
>   b) If done in an Aggregate (could be an inner one), guarantee that IDs
> that existed upon entry to the aggregate would be preserved.  This could
> be done using the high-water-mark mechanism that was put in for
> delta-cas.  This seems to have some nice properties concerning what a
> user of an aggregate has to know about the aggregate's inner behavior -
> in particular, the user would not need to know about that aggregate's
> call to GC, since the outer aggregate's "handles" to feature structures
> would not move due to the GC.

I'm a little foggy on the concept of explicit GC calls.  Where would
the API reside, and who would be allowed to call it, and when?

> Part of this concept could be to allow an option to have the GC to move
> everything; maybe a different explicit call.
> 3) There is a potential to trade performance for space using String
> "interning" - to insure string values set for features are stored just
> once.  This is typically done using a hashmap of some kind, so there's
> overhead for that, so it may or may not actually reduce space.  There is
> also a potential to store Strings using UTF-8 encoding - which may or
> may not save space (depends on the string, etc.)

Any Java programmer who does a lot of string handling should know
when to intern strings, when to use constants, and when to create
new Strings.  Who do you think you're going to be helping with this?
What's the use case?

> For String interning, we could have two different kinds of approaches to
> specifying its use: a global, application-wide setting or a specific
> setting (e.g., add a new basic type to UIMA, called, for instance,
> cas.uima.SharedString). Using a new type would allow users to pick just
> the cases where they wanted the extra machinery to coalesce equal
> strings to one shared object.
> 4) There is a potentially big space reduction possible by being able to
> mark some fields of feature structures as never being "read".  Such
> fields could then be not stored.  For instance, a feature structure of
> type TOKEN might have many fields, representing various information -
> only some of which might be used in a particular application.  Even if
> the ResultSpecification for a tokenizer is set to indicate not to
> "produce" these fields, today, space is consumed for those fields (they
> are filled with "null" or 0).  If there are many instances of this
> feature structure type (such as Token), this can be a significant space
> saver.  To identify a field as never being read, one could look at the
> aggregate's component's capability specification - and mark the field if
> it doesn't appear in any of the delegate's input specs.  For an
> outermost aggregate, one would probably want to add that aggregate's
> output specification - to capture the outer application's potential use
> of fields.

We need to be careful here not to destroy backward compatibility.
Result specs are optional in fixed flows, and many annotators (that
I use) don't use them.

What we usually do in cases like this is that we modify the type
system.  The annotator (in this case, the tokenizer) checks the
type system on startup.  Presence/absence of features triggers/
inhibits certain processing.  This may not be an ideal solution,
but it works because it requires the cooperation of the annotator

If you don't have access to the annotator's source code, you'll
never know if it can really work without those features in all
cases.  If you do have the source code, you can make it work
with the scheme above.

> There is another layer of space granularity we could consider. This
> would be to let the assembler divide the components into "groups"
> (perhaps just using the natural grouping aggregates provide), and
> compute (or specify) which fields of types were "not used" by group. For
> instance, consider group1 which has lots of extra fields in Token, which
> are used during group1 components, followed by group2 processing, which
> doesn't use most of the fields in Token.  A GC operation could then
> "compress" the representation of Token (of those where preserving the
> FeatureStructure ID wasn't required). 
> To avoid creation of new collections of components, we could restrict
> the "groups" to be just what a particular aggregate contained.  Using
> this approach, we could envision using the input/output capability
> specifications to automatically deduce which extra fields could be
> eliminated.  We could also have an automatic GC mode - which would
> invoke the GC (that didn't alter pre-existing feature structures) at the
> end of all aggregates.  Although this might do too many gc's, it would
> be conceptually simple.
> -Marshall

View raw message