uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject Re: UIMA and BSP
Date Thu, 17 May 2012 09:37:34 GMT
Hi Tommaso,

as I understand it each CAS is processed independently and without 
parallelization, right? If so, what you are doing does not look that 
much like MapReduce (since you don't reduce) but is closer to just 
running many parallel instances on subsets of the collection.

We are currently using Sun Grid Engine to launch CPE instances on 
several nodes, getting the input data (in plain text or XMI format) from 
a MySQL database and writing XMI output to the DB. That way we avoid 
synchronization issues and can distribute data between instances with 
the simple modulo trick in the SELECT query.

We also tried using UIMA AS, but the overhead seemed very big. Maybe by 
just having fully colocated aggregates, each working on one CAS from 
beginning to end it wouldn't be too bad, then we would just have one 
central CollectionReader that dispatches to the different aggregates. 
You don't seem to parallelize within the processing flow, so that's 
quite close to what your example does, isn't it?


On 05/17/2012 09:25 AM, Tommaso Teofili wrote:
> Hi all,
> recently I've been playing (and coding) with BSP [1] based algorithms using
> Apache Hama [2] (which officially graduated to TLP yesterday) and I found
> that in many cases there were significant performance boosts with respect
> to a "plain" MapReduce based algorithm, so I thought it would have made
> sense to write a UIMA collection processing algorithm using Hama.
> I started sketching it up on a sample project on GitHub [3] but I think it
> would make sense to put it on our sandbox so that anyone can have a
> look/use/improve/evaluate it.
> The current implementation I have just reads files from a directory inside
> the filesystem, process them in parallel and collects the ProcessTraces
> inside an output file but my idea is that it may come just as a new CPM
> implementation reading and writing from/to HDFS.
> I know it's a lot of things in few lines so feel free to ask for more
> clarifications.
> Have a nice day,
> Tommaso
> [1] : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
> [2] : http://incubator.apache.org/hama
> [3] :
> https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java

View raw message