uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: UIMA and BSP
Date Thu, 17 May 2012 12:10:36 GMT
Hi Jens,

2012/5/17 Jens Grivolla <j+asf@grivolla.net>

> Hi Tommaso,
> as I understand it each CAS is processed independently and without
> parallelization, right? If so, what you are doing does not look that much
> like MapReduce (since you don't reduce) but is closer to just running many
> parallel instances on subsets of the collection.

there is no Reduce phase as I didn't want to implement a MR algorithm, but
just a BSP (the only "reduce-like" phase in that implementation is the
trivial one where the ProcessTraces are collected and written to file).
BTW: you may find the BSP vs MapReduce paper [1] interesting.

> We are currently using Sun Grid Engine to launch CPE instances on several
> nodes, getting the input data (in plain text or XMI format) from a MySQL
> database and writing XMI output to the DB. That way we avoid
> synchronization issues and can distribute data between instances with the
> simple modulo trick in the SELECT query.

that sounds like a good approach as well.

> We also tried using UIMA AS, but the overhead seemed very big. Maybe by
> just having fully colocated aggregates, each working on one CAS from
> beginning to end it wouldn't be too bad, then we would just have one
> central CollectionReader that dispatches to the different aggregates. You
> don't seem to parallelize within the processing flow, so that's quite close
> to what your example does, isn't it?

Yes, it is, my sample implementation works like this:

1. first superstep:
1a. computation: each node instantiates and initializes an AE
1b. communication: a "master" node sends the docs to the other nodes
2. second superstep
2a. computation: each node process all the docs received with the AE
2b. communication: each node sends the processing results to the "master"
3. third superstep:
3a. computation: the "master" collects the received results and all the
nodes deallocate resources
3b. communication: nothing

The good thing is that this can be run locally with multi core processors
to execute #docs%#processors parallel processes but, once the input is
placed in HDFS rather than a local directory, also with remote machines
without changing a single line of code (just configuring the Hama cluster,
see [2]).
For example, on my computer each of the 4 CPUs runs a different job, so
during the analysis phase 4 documents/CASes at a time are processed in

However this is just a sample implementation which is not supposed to be
production ready, maybe just a starting point for exploring if we can use
this BSP based model to create new implementations of CPE/CPM and compare
them with other UIMA distributed architectures (UIMA-AS, your SGE solution,


[1] : http://arxiv.org/pdf/1203.2081.pdf
[2] : http://wiki.apache.org/hama/GettingStarted#Modes

> Bye,
> Jens
> On 05/17/2012 09:25 AM, Tommaso Teofili wrote:
>> Hi all,
>> recently I've been playing (and coding) with BSP [1] based algorithms
>> using
>> Apache Hama [2] (which officially graduated to TLP yesterday) and I found
>> that in many cases there were significant performance boosts with respect
>> to a "plain" MapReduce based algorithm, so I thought it would have made
>> sense to write a UIMA collection processing algorithm using Hama.
>> I started sketching it up on a sample project on GitHub [3] but I think it
>> would make sense to put it on our sandbox so that anyone can have a
>> look/use/improve/evaluate it.
>> The current implementation I have just reads files from a directory inside
>> the filesystem, process them in parallel and collects the ProcessTraces
>> inside an output file but my idea is that it may come just as a new CPM
>> implementation reading and writing from/to HDFS.
>> I know it's a lot of things in few lines so feel free to ask for more
>> clarifications.
>> Have a nice day,
>> Tommaso
>> [1] : http://en.wikipedia.org/wiki/**Bulk_synchronous_parallel<http://en.wikipedia.org/wiki/Bulk_synchronous_parallel>
>> [2] : http://incubator.apache.org/**hama<http://incubator.apache.org/hama>
>> [3] :
>> https://github.com/tteofili/**samplett/blob/master/uima-bsp/**
>> src/main/java/com/github/**samplett/uima/bsp/**AEProcessingBSPJob.java<https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message