uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kl├╝gl <peter.klu...@averbis.com>
Subject Re: Performance and ALLOW_DUP_ADD_TO_INDEXES
Date Thu, 07 Jan 2016 07:38:23 GMT

rule engineering in ruta follows a paradigm where you spam annotations,
also with same type and offsets. Additionally, and probably the most
critical case, the explanation of the rule inference creates those
annotations for debugging. Imaging something like an annotation of the
type "Debug" for every text span where some rule tried to match (and did
not even succeed).

In one of my use cases, the time spent in deserialization in the CAS
Editor was reduced from 70% (xmi) to 2% (binary).

Here's the discussion:



Am 07.01.2016 um 02:38 schrieb Marshall Schor:
> Thanks for the information!
> It does look like there's a performance issue if not allowing duplicate adds, for
> exactly the use case you mentioned: lots of FSs which compare "equal" according
> to the
> sorted index keys, but which are not the same FS.
> This can be fixed I think. 
> There's also a user-centered workaround, for some cases, when it's possible to
> re-define the type system somewhat.
> One kind of thing I've seen frequently, is that people define types having
> nothing to do with Annotation (e.g. they don't use begin / end, etc.) as
> subtypes of Annotation.
> If you are able to change the type system definitions so that these things no
> longer are subtypes of Annotation, then the problem might go away.
> -Marshall
> On 1/6/2016 6:18 PM, Richard Eckart de Castilho wrote:
>>>> I am starting to get suspicious of global flags for backwards compatibility.
>>>> E.g. since ALLOW_DUP_ADD_TO_INDEXES was introduced, we have people complaining
>>>> about a performance drop. ALLOW_DUP_ADD_TO_INDEXES can only be enabled/disabled
>>>> globally, but not specifically for individual indexes. Neither can it be

>>>> temporarily disabled, e.g. during deserialization or other bulk operations.
>>>> I wonder if local getters/setters or ThreadLocal variables initialized by
>>>> a global setting wouldn't be a more appropriate option.
>>> I was unaware of the performance issue; I may have missed some emails...  Can
>>> you say how significant it is?  If there were no performance issue, would the
>>> additional function be needed?
>>> I assume the performance drop is when duplicates are not allowed (the new
>>> default), and some users are wanting to restore the previous performance by
>>> turning on ALLOW_DUP ....  Is this correct?
>> I didn't track it in detail, but apparently, some time back Peter noticed a
>> drop in XMI deserialization performance and more recently also in compressed
>> binary CAS deserialization. Some time later, I had a person claiming in
>> private mail that deserialization was O(n^2) with respect to the CAS size.
>> At that point, I had a look at the code and it appears that in the worst
>> case, the duplication check degrades to a linear CAS scan
>> (cf. FSIndexRepositoryImpl line 98ff and FSIntArrayIndex line 101ff).
>> That would if the CAS contains only items that are equal with respect
>> to the index criteria, but not actually equal. 
>> Consider a hypothetical annotation type:
>> Metadata extends Annotation {
>>   String key;
>>   String value;
>> }
>> where the begin/end are always set to 0..documentLength() and
>> key/value have arbitrary values. I didn't try it, but if I 
>> understood the code correctly, a CAS containing only such
>> annotations would suffer heavily during the addToIndexes().
>> Cheers,
>> -- Richard

View raw message