uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Performance and ALLOW_DUP_ADD_TO_INDEXES
Date Thu, 07 Jan 2016 01:38:57 GMT
Thanks for the information!

It does look like there's a performance issue if not allowing duplicate adds, for
exactly the use case you mentioned: lots of FSs which compare "equal" according
to the
sorted index keys, but which are not the same FS.

This can be fixed I think. 

There's also a user-centered workaround, for some cases, when it's possible to
re-define the type system somewhat.

One kind of thing I've seen frequently, is that people define types having
nothing to do with Annotation (e.g. they don't use begin / end, etc.) as
subtypes of Annotation.

If you are able to change the type system definitions so that these things no
longer are subtypes of Annotation, then the problem might go away.

On 1/6/2016 6:18 PM, Richard Eckart de Castilho wrote:
>>> I am starting to get suspicious of global flags for backwards compatibility.
>>> E.g. since ALLOW_DUP_ADD_TO_INDEXES was introduced, we have people complaining
>>> about a performance drop. ALLOW_DUP_ADD_TO_INDEXES can only be enabled/disabled
>>> globally, but not specifically for individual indexes. Neither can it be 
>>> temporarily disabled, e.g. during deserialization or other bulk operations.
>>> I wonder if local getters/setters or ThreadLocal variables initialized by
>>> a global setting wouldn't be a more appropriate option.
>> I was unaware of the performance issue; I may have missed some emails...  Can
>> you say how significant it is?  If there were no performance issue, would the
>> additional function be needed?
>> I assume the performance drop is when duplicates are not allowed (the new
>> default), and some users are wanting to restore the previous performance by
>> turning on ALLOW_DUP ....  Is this correct?
> I didn't track it in detail, but apparently, some time back Peter noticed a
> drop in XMI deserialization performance and more recently also in compressed
> binary CAS deserialization. Some time later, I had a person claiming in
> private mail that deserialization was O(n^2) with respect to the CAS size.
> At that point, I had a look at the code and it appears that in the worst
> case, the duplication check degrades to a linear CAS scan
> (cf. FSIndexRepositoryImpl line 98ff and FSIntArrayIndex line 101ff).
> That would if the CAS contains only items that are equal with respect
> to the index criteria, but not actually equal. 
> Consider a hypothetical annotation type:
> Metadata extends Annotation {
>   String key;
>   String value;
> }
> where the begin/end are always set to 0..documentLength() and
> key/value have arbitrary values. I didn't try it, but if I 
> understood the code correctly, a CAS containing only such
> annotations would suffer heavily during the addToIndexes().
> Cheers,
> -- Richard

View raw message