uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: small memory footprint tradeoff configuration
Date Wed, 11 Mar 2009 11:25:16 GMT
Thanks for your comments.

Thilo Goetz wrote:
> Marshall Schor wrote:
>> After reviewing the previous chain of discussion on this topic, I would
>> like to start the next round, hopefully getting to some convergence :-).
>> 1) On the topic of doing GC (garbage collection) versus copy to another
>> CAS - GC is conceptually perhaps less complex - you don't have mutliple
>> CASes around.  I note that Adam (and maybe others) preferred this.
>> 2) Keeping IDs: within the context of an aggregate, doing a GC and
>> having the resulting FeatureStructure IDs change is an issue.  Some
>> suggestions to lessen this:
>>   a) No automatic GC, just do when requested.
>>   b) If done in an Aggregate (could be an inner one), guarantee that IDs
>> that existed upon entry to the aggregate would be preserved.  This could
>> be done using the high-water-mark mechanism that was put in for
>> delta-cas.  This seems to have some nice properties concerning what a
>> user of an aggregate has to know about the aggregate's inner behavior -
>> in particular, the user would not need to know about that aggregate's
>> call to GC, since the outer aggregate's "handles" to feature structures
>> would not move due to the GC.
> I'm a little foggy on the concept of explicit GC calls.  Where would
> the API reside, and who would be allowed to call it, and when?
One possibility is to not have it be callable, but to have an annotator
or aggregate marked so that it would be called by the framework after a
CAS exited that component.  I haven't figured out other details here,
though.  Details might include (a) an overall enabling of this, (b) some
indication that there's enough to reclaim in the CAS to make it
worthwhile.  Worthwhile may be hard to define, though, so some kind of
explicit indication to do it might be needed.  The indication might be
to set some context flag saying to do the GC when you exit the
annotator.  Here's a use case: testing shows your application is taking
31 MB of storage, but the app in which you're embedded has a hard limit
of 30 MB... 
>> Part of this concept could be to allow an option to have the GC to move
>> everything; maybe a different explicit call.
>> 3) There is a potential to trade performance for space using String
>> "interning" - to insure string values set for features are stored just
>> once.  This is typically done using a hashmap of some kind, so there's
>> overhead for that, so it may or may not actually reduce space.  There is
>> also a potential to store Strings using UTF-8 encoding - which may or
>> may not save space (depends on the string, etc.)
> Any Java programmer who does a lot of string handling should know
> when to intern strings, when to use constants, and when to create
> new Strings.  
Well, my experience is quite different.  Many Java programmers around
here were unfamiliar with interning.  But I do basically agree that some
(or most) of this benefit can happen via annotator writers.  Perhaps we
need to document this in some new section (e.g. on how to write small
footprint annotators).

> Who do you think you're going to be helping with this?

> What's the use case?
>> For String interning, we could have two different kinds of approaches to
>> specifying its use: a global, application-wide setting or a specific
>> setting (e.g., add a new basic type to UIMA, called, for instance,
>> cas.uima.SharedString). Using a new type would allow users to pick just
>> the cases where they wanted the extra machinery to coalesce equal
>> strings to one shared object.
>> 4) There is a potentially big space reduction possible by being able to
>> mark some fields of feature structures as never being "read".  Such
>> fields could then be not stored.  For instance, a feature structure of
>> type TOKEN might have many fields, representing various information -
>> only some of which might be used in a particular application.  Even if
>> the ResultSpecification for a tokenizer is set to indicate not to
>> "produce" these fields, today, space is consumed for those fields (they
>> are filled with "null" or 0).  If there are many instances of this
>> feature structure type (such as Token), this can be a significant space
>> saver.  To identify a field as never being read, one could look at the
>> aggregate's component's capability specification - and mark the field if
>> it doesn't appear in any of the delegate's input specs.  For an
>> outermost aggregate, one would probably want to add that aggregate's
>> output specification - to capture the outer application's potential use
>> of fields.
> We need to be careful here not to destroy backward compatibility.
> Result specs are optional in fixed flows, and many annotators (that
> I use) don't use them.
> What we usually do in cases like this is that we modify the type
> system.  The annotator (in this case, the tokenizer) checks the
> type system on startup.  Presence/absence of features triggers/
> inhibits certain processing.  This may not be an ideal solution,
> but it works because it requires the cooperation of the annotator
> writer.
> If you don't have access to the annotator's source code, you'll
> never know if it can really work without those features in all
> cases.  If you do have the source code, you can make it work
> with the scheme above.
I agree that backward compatibility is important and is an issue.  To
help the transition to this new scheme, I think an overall global switch
is needed (similar to the switches we have for JCas "interning") that
would by default make things work the way they do now.  A user
interested in small-footprint operation (and in trading off some
additional processing cycles to achieve it) would enable this switch.

To help it "work" - we would allow things to continue to operation which
"set" a non-stored feature - theset would just become no-ops.  Then if
the annotator wasn't paying attention to ResultSpecification, and tried
to set features that were not used, it would still work. 

On the other end, if an annotator actually made use of a particular
feature, but didn't specify it in its "input capability specification",
that would fail with this scheme.  The failure would be some kind of
Java exception, which would probably be noticed.  To recover, a user of
such a component would modify the input capability specification to
indicate that that feature was needed. 

As I write this, I notice that the input capability specification for a
primitive annotator doesn't quite fit the meaning hear - because I think
it means that this annotator needs that feature upon input - and this
edge case - where the annotator itself produces this feature, and then
also uses it - is not part of that definition. We could either expand
the meaning here to include this edge case, or (possibly a better
option) introduce, explicitly, another piece of metadata indicating that
a particular type/field was both created and used by this one primitive
annotator.  A third option could be to store these "unused" features if
set (in some out-of-line temporary storage) for the duration of the
running of a particular annotator, just in case these were "used" by the
same annotator, and then discard that extra storage after the annotator
exits.  This would be a big (but temporary) storage hit, though, so I
don't think I would want to do this.

>> There is another layer of space granularity we could consider. This
>> would be to let the assembler divide the components into "groups"
>> (perhaps just using the natural grouping aggregates provide), and
>> compute (or specify) which fields of types were "not used" by group. For
>> instance, consider group1 which has lots of extra fields in Token, which
>> are used during group1 components, followed by group2 processing, which
>> doesn't use most of the fields in Token.  A GC operation could then
>> "compress" the representation of Token (of those where preserving the
>> FeatureStructure ID wasn't required). 
>> To avoid creation of new collections of components, we could restrict
>> the "groups" to be just what a particular aggregate contained.  Using
>> this approach, we could envision using the input/output capability
>> specifications to automatically deduce which extra fields could be
>> eliminated.  We could also have an automatic GC mode - which would
>> invoke the GC (that didn't alter pre-existing feature structures) at the
>> end of all aggregates.  Although this might do too many gc's, it would
>> be conceptually simple.
>> -Marshall

View raw message