uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: CasIOUtils class - some meta-questions
Date Thu, 11 Aug 2016 19:17:40 GMT
Re: skipping re-reading the separate TSI file info: OK, makes sense.

One thing I wonder, in terms of "useless work avoiding":

If you're in a loop like this:

  casMgr = read(casMgrFile);
  for (file in directory) {
    load(file, casMgr, CAS, boolean)

it seems it would make more sense to read the TSI info once, and use it to set
up the CAS, once (build/commit the type system, create the index repositories,
etc.), and then load loop would run potentially significantly faster.

That should be trivial to do for a CAS - e.g. adding 1 line to your example set
up above:

casMgr = read(casMgrFile);
casImpl.setupCasFromCasMgrSerializer(casImpl, casMgr);

If this seems better to you than an API passing in an instance of a (saved)
casMgr, then I'd like to leave out of the CasIOUtils API the load with the
CASManagerSerializer argument.


On 8/11/2016 1:59 PM, Richard Eckart de Castilho wrote:
> On 11.08.2016, at 19:43, Marshall Schor <msa@schor.com> wrote:
>> I'm working on this now.
>> I note that the new load(InputStream, CasMgrSerialzer, CAS, boolean) method is
>> "public".  Is there some code (perhaps in DkPro) that needs this form?
>> If not, I'll remove this method and make the reading to create the
>> CasMgrSerializer "lzay" - not done until needed.
> Yep, I need something like that in DKPro.
> When the type system information is stored outside the binary CAS in a
> separate file, that TSI file would have to be re-read for every CAS file.
> Being able to pass he CasMgrSerialzer to load() allows me to read it only
> once.
>> Not sure about zipping the type system - we have 3 choices, perhaps: 1) nothing,
>> 2) zip, 3) custom compression zip (like the rest of form 6).
>> I'm leaning toward doing this work (if ever done) later.
> I've been pushing that ahead since implementing the BinaryCasReader/Writer :)
> Probably doesn't hurt if it gets pushed ahead a bit further.
> I had a quick look at the CasMgrSerialzer - you called it highly inefficient.
> It doesn't look that inefficient. At least it uses primitive and String arrays
> and not collections :)
>> ================
>> I have one more question - there's a comment which I don't see implemented -
>> which says that when a set of deserializations are being done with the same type
>> system, the extra work to handle the type system is only done once:
>>   * This method avoids the repeated loading of the typesystem and index definitions
>>   * from a stream when loading many CASes in a row.
>> How do you think that should be implemented?
> Well, that's happening when I read the CasMgrSerialzer from a separate file - as
> explained above:
>   casMgr = read(casMgrFile)
>   for (file in directory) {
>     load(file, casMgr, CAS, boolean)
>   }
> Cheers,
> -- Richard

View raw message