uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Performance and ALLOW_DUP_ADD_TO_INDEXES
Date Thu, 07 Jan 2016 14:12:54 GMT
Thanks for explaining this "use case". 

I was a bit unclear on the two instances of deserialization time. 
One (the 70%) was xmi, the other (2%) was S+.  From reading the email chain, it
seems S+ is the "CasCompleteSerializer".  This switches to plain binary mode. 
So you would avoid the XML parsing overhead. 

But I think both deserializations would have the same issue around "allow_dups"
if that was where the substantial part of the slowdown was being spent, since
both would add all those annotations to the index.  Perhaps that was another use
case though...  Am I mixing these up?


On 1/7/2016 2:38 AM, Peter Kl├╝gl wrote:
> Hi,
> rule engineering in ruta follows a paradigm where you spam annotations,
> also with same type and offsets. Additionally, and probably the most
> critical case, the explanation of the rule inference creates those
> annotations for debugging. Imaging something like an annotation of the
> type "Debug" for every text span where some rule tried to match (and did
> not even succeed).
> In one of my use cases, the time spent in deserialization in the CAS
> Editor was reduced from 70% (xmi) to 2% (binary).
> Here's the discussion:
> https://issues.apache.org/jira/browse/UIMA-4685
> Best,
> Peter
> Am 07.01.2016 um 02:38 schrieb Marshall Schor:
>> Thanks for the information!
>> It does look like there's a performance issue if not allowing duplicate adds, for
>> exactly the use case you mentioned: lots of FSs which compare "equal" according
>> to the
>> sorted index keys, but which are not the same FS.
>> This can be fixed I think. 
>> There's also a user-centered workaround, for some cases, when it's possible to
>> re-define the type system somewhat.
>> One kind of thing I've seen frequently, is that people define types having
>> nothing to do with Annotation (e.g. they don't use begin / end, etc.) as
>> subtypes of Annotation.
>> If you are able to change the type system definitions so that these things no
>> longer are subtypes of Annotation, then the problem might go away.
>> -Marshall
>> On 1/6/2016 6:18 PM, Richard Eckart de Castilho wrote:
>>>>> I am starting to get suspicious of global flags for backwards compatibility.
>>>>> E.g. since ALLOW_DUP_ADD_TO_INDEXES was introduced, we have people complaining
>>>>> about a performance drop. ALLOW_DUP_ADD_TO_INDEXES can only be enabled/disabled
>>>>> globally, but not specifically for individual indexes. Neither can it
>>>>> temporarily disabled, e.g. during deserialization or other bulk operations.
>>>>> I wonder if local getters/setters or ThreadLocal variables initialized
>>>>> a global setting wouldn't be a more appropriate option.
>>>> I was unaware of the performance issue; I may have missed some emails...
>>>> you say how significant it is?  If there were no performance issue, would
>>>> additional function be needed?
>>>> I assume the performance drop is when duplicates are not allowed (the new
>>>> default), and some users are wanting to restore the previous performance
>>>> turning on ALLOW_DUP ....  Is this correct?
>>> I didn't track it in detail, but apparently, some time back Peter noticed a
>>> drop in XMI deserialization performance and more recently also in compressed
>>> binary CAS deserialization. Some time later, I had a person claiming in
>>> private mail that deserialization was O(n^2) with respect to the CAS size.
>>> At that point, I had a look at the code and it appears that in the worst
>>> case, the duplication check degrades to a linear CAS scan
>>> (cf. FSIndexRepositoryImpl line 98ff and FSIntArrayIndex line 101ff).
>>> That would if the CAS contains only items that are equal with respect
>>> to the index criteria, but not actually equal. 
>>> Consider a hypothetical annotation type:
>>> Metadata extends Annotation {
>>>   String key;
>>>   String value;
>>> }
>>> where the begin/end are always set to 0..documentLength() and
>>> key/value have arbitrary values. I didn't try it, but if I 
>>> understood the code correctly, a CAS containing only such
>>> annotations would suffer heavily during the addToIndexes().
>>> Cheers,
>>> -- Richard

View raw message