uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D.J. McCloskey" <dj_mcclos...@ie.ibm.com>
Subject Re: small memory footprint tradeoff configuration
Date Thu, 12 Mar 2009 21:25:24 GMT
I think this would be a good addition. I don't see a need for an explicit
call to invoke GC. The parameter of a threshold relative to CAS heapsize
would be useful. But also the ability to specify a GC on exiting one
analysis engine and before invoking the next. This could be turned off for
applications where folks don't want the overhead caused in complex
aggregates. Ideally, a combination of the threshold and the AE boundary
auto collect would also be possible by a specific values of the engine
boundary parameter. The idea here would be to only do the collect on the
boundary based on the heap/threshold ratio having been exceeded. This would
give more than enough control and addresses the need.

I don't know if it would have to be limited to a capability flow scenario
but it would be nice to also have the option to use the output capabilities
to define what to keep or was this what you were thinking?.

The notion of type driven collection is something I'd like to raise also,
often in analysis there are types which are purely supporting the
identification of some primary type. The ability to declare some types as
transient would perhaps aid the overall objective and mitigate in the
aggressive GC cases. On the aggressive GC point, I believe anything which
invalidates lowlevel handles is not really acceptable.

-DJ
-------------------
D.J McCloskey
IBM LanguageWare

IBM Ireland Product Distribution Limited registered in Ireland with number
92815.  Registered office: Oldbrook House, 24-32 Pembroke Road,
Ballsbridge, Dublin 4



                                                                                         
                                                
  From:       Marshall Schor <msa@schor.com>                                       
                                                      
                                                                                         
                                                
  To:         uima-dev@incubator.apache.org                                              
                                                
                                                                                         
                                                
  Date:       12/03/2009 18:06                                                           
                                                
                                                                                         
                                                
  Subject:    Re: small memory footprint tradeoff configuration                          
                                                
                                                                                         
                                                





I agree with both of these concepts:  only GC'ing things which are not
in the index and also not reachable from something that is in the index,
and making GC'ing (mostly) automatic, based on thresholds, etc, when a
component exits back to the framework.  This would be fine for now - if
use cases come up where some more programmatic control of this is
needed, we could add something.

Maybe the next thing to focus on is the "contract" re: GC running.  For
a component (primitive or aggregate), the proposed contract is to have
the GC not change the FS "id"s that existed prior to the component
running.  This is a tradeoff - for more stability with existing handle
uses, versus less "aggressive" GC's.

-Marshall

Thilo Goetz wrote:
> Adam Lally wrote:
>
>> On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <msa@schor.com> wrote:
>>
>>> I agree in general about not making things more complicated at least to
>>> the user.  I can imagine education working for
>>>  1) things like string interning
>>>  2) things like deleting features from type systems where they're not
>>> being used, and where the annotator producing them will respect this.
>>>
>>> What this approach seems to miss are the following kinds of things:
>>>
>>> 1) cases where some set of annotators produce feature structures,
which,
>>> after some point, are no longer needed, and are "deleted" but
>>> never-the-less continue to consume space.
>>>
>>> 2) cases where some set of annotators produce feature structures having
>>> lots of fields, where, after some point, the fields are no longer
needed.
>>>
>>> If these are not significant use-cases in practice, then I'm happy to
>>> think-about / work-on other things :-).
>>>
>>>
>> I'd like to propose discussing the different ideas here one at a time.
>>  We had enough trouble coming to any agreement on GC the last time
>> that we discussed it, without also throwing string interning and
>> feature deleting into the mix.
>>
>> So focusing on GC first (unless you think one of the others is more
important):
>>
>> My inclination is to assure that GC deletes only garbage, and that
>> there's no possibility that anything GC'ed could have been referenced
>> by anybody.  The other proposals that don't have this guarantee are
>> scary to me.
>>
>> A way to accomplish this guarantee would be that when the process
>> method of an AnalysisEngine (could be either primitive or aggregate)
>> completes, we can mark as garbage any FS's that were created since the
>> beginning of that process method, but which are not referenced
>> directly or indirectly from anything in the indexes.  Does this
>> concept seem reasonable?
>>
>
> +1. I like the idea because it is sort of local on the one
> hand, but still allows one to delete FSs from indexes
> later in the processing and have them garbage collected
> (on exiting the containing aggregate).
>
>
>> The next question is under what conditions would a GC execute.
>> Requiring an explicit call seems counter to what other garbage
>> collecting runtime environments do, and like Thilo I'm confused about
>> who would call this and when.  I think it would be better to define
>> the parameters that control GC in the PerformanceTuningSettings that
>> we already have, and make them dependent on how much CAS heap space is
>> used relative to a GC threshold that the user has set in the
>> PerformanceTuningSettings.
>>
>
> +1, and the default could be "no GC", so it would be
> perfectly backwards compatible.  I'm thinking of the
> kinds of scenarios that I often work with, where
> basically all the annotations are later written to
> an index, and any attempt at GC would be futile and
> just consume time to no benefit.
>
>
>>  -Adam
>>
>
>
>
>



Mime
View raw message