uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: opinion on degree of backwards compatibility for Uima V3 experiment
Date Fri, 09 Sep 2016 09:38:02 GMT
I think the best way to answer this question is to write a few fully working
simple examples which use UIMA 2 and different Hadoop frameworks,
e.g. MapReduce, Spark, etc. and see how we can make it a pleasure to use
UIMA
with those.

I sketched out some Spark code which shows how I would like to use UIMA.
But I think today things are much more complex and some things are not
possible, or fast (CAS is not designed to be immutable).

  public static void main(String[] args) {

    TypeSystem ts = ...

    JavaSparkContext sc = new JavaSparkContext();
    JavaRDD<String> texts = sc.textFile("hdfs://...");

    JavaRDD<CAS> docs = texts.map(new Function<String, CAS>() {
      @Override
      public CAS call(String text) throws Exception {
        // create CAS from shared type system
        // add the text to the cas
      }
    });

    AnalysisEngine tokenizer = ...

    docs = docs.map(cas -> tokenizer.process(cas));

    Type tokenType = null;
    JavaRDD<Integer> counts = docs.map(new Function<CAS, Integer>() {
      @Override
      public Integer call(CAS cas) throws Exception {
        int tokenCount = 0;
        for (FeatureStructure fs : cas.getAnnotationIndex(tokenType)) {
          tokenCount++;
        }
        return tokenCount++;
      }
    });

    counts.reduce((a, b) -> a + b);
  }

Jörn




On Wed, Sep 7, 2016 at 3:45 PM, Marshall Schor <msa@schor.com> wrote:

> Hi Jörn,
>
> Thanks for your input.  Could you possible expand with a few specifics on
> what
> changes you think would make it easier to use with Hadoop etc.?
>
> -Marshall
>
>
> On 9/7/2016 7:46 AM, Joern Kottmann wrote:
> > Hello all,
> >
> > at my work place we use UIMA mostly with custom code to load data into a
> > pipeline and store its results,
> > therefore we don't depend at all on the UIMA serialization formats. And
> > changing them, or adding new ones which
> > are incompatible wouldn't be an issue at all. Also the existing code can
> be
> > ported to work with UIMA 3.
> >
> > I really hope we can get UIMA 3 into a shape where it is easier to use
> with
> > todays requirements (e.g. with Hadoop)
> > and possibilities.
> >
> > I personally think that the effort to create the next overhauled version
> > shouldn't be limited in anyway by backward compatibility.
> > For me it is a good solution if there is some help with migrating things
> to
> > UIMA 3 (e.g. a guide which explains what to do)
> > and maybe maintaining UIMA 2 for a while in parallel (e.g. fixes of very
> > urgent/critical bugs).
> >
> > Jörn
> >
> > On Fri, Sep 2, 2016 at 7:56 PM, Richard Eckart de Castilho <
> rec@apache.org>
> > wrote:
> >
> >> See comment at end of mail.
> >>
> >> On 02.09.2016, at 15:18, Marshall Schor <msa@schor.com> wrote:
> >>> To go from an ID to an FS is not generally possible, because normally,
> >> the
> >>> framework doesn't keep this association.  There are exceptions though,
> >> the main
> >>> ones being:
> >>>
> >>> a) If you use low level CAS Apis to create FSs, the API returns the ID,
> >> which
> >>> means, that a GC that happens right after the API returns would garbage
> >> collect
> >>> the FS because at that point, nothing is "holding on" to any reference
> >> (it's not
> >>> in any index).  To prevent this, the low level create FS methods add
> the
> >> FS to a
> >>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing
> >> Garbage
> >>> collection.
> >>>
> >>> b) Another case where this happens is when PEARs are used; in this case
> >> the FSs
> >>> involved with PEAR "trampoline" FSs end up being in similar maps.
> >>>
> >>> Both of these approaches of course disable a feature of V3 - namely,
> that
> >>> unrefererenced FSs can be garbage collected.
> >>>
> >>> ...
> >>>
> >>> There is an API in the V3 CASImpl, getFsFromId(int)  and also
> >>> getFsFromId_checked(int), which retrieves the associated FS, given the
> >> ID, or
> >>> returns null (or throws an exception) if it isn't in the table.  Most
> FSs
> >>> created normally, won't be in the table.
> >> Can we do this? -> As soon as an FS has been added to an index or is
> being
> >> referenced from another FS, its ID should be resolvable to the
> respective
> >> FS.
> >>
> >> When an FS is in an index or being referred by another FS, it cannot be
> >> garbage collected anyway. The CAS could maintain a lookup using weak
> >> references to provides a central place to look up such FSes via their
> IDs
> >> without preventing garbage collection.
> >>
> >> WebAnno remembers the ID of every FS rendered on screen. When the user
> >> makes an action, we load the CAS from disk and then look up the ID to
> >> retrieve the FS. We do not keep the CAS in memory all the time. If we
> would
> >> have to scan the whole CAS for the FS with a given ID, it would have
> >> probably a serious performance impact.
> >>
> >> Cheers,
> >>
> >> -- Richard
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message