uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Richardet <renaud.richar...@gmail.com>
Subject Re: Flexibility of binary CAS serialization
Date Tue, 18 Dec 2012 15:47:54 GMT
Hi Erik,

I have been working on a UIMA module (using MongoDB) that could fit your needs:

When repetitively processing the same documents with different UIMA
pipelines, some pipeline steps (e.g. preprocessing) are duplicated
among the runs. The MongoDb module allows to persist annotated
documents, resume their processing and add new annotations to them.
MongoDb is a high-performance NOSQL document database. In the MongoDb
module, every CAS is stored as a document, along with its annotations.
UIMA annotations and their features are explicitly mapped to MongoDb
fields, using a simple and declarative language. The mappings are used
during persistence and loading from the database. The following UIMA
components are available:
• MongoCollectionReader reads CAS from a MongoDb collection.
Optionally, one can specify a (filter) query, e.g.
– {my_db_field:{exists:true}} for the existence of a field;
– {pmid: 17} to query a specific PubMed document;
– {pmid:{$in:[12,17]}} to query a list of PubMed documents;
– {pmid:{ $gt: 8, $lt: 11 }} for a range of documents.
• RegexMongoCollectionReader is similar to MongoCollectionReader but
allows to specify a query with a regular expression on a specific
field
• MongoWriter persists a new UIMA CASes into MongoDb documents
• MongoUpdateWriter persist new annotations into an existing MongoDb docu-
ment
• MongoCollectionRemover allows to remove selected annotation types from a
MongoDb collection.
With the above components, it is possible within a single pipeline to
read an existing collection of annotated documents, perform some
further processing, add more annota- tion, and store theses
annotations back into the same MongoDb documents2. In terms of
performance, the MongoDb module has been tested with a corpus of
PubMed abstracts (approximately 22 Mio documents, throughput over 2000
docs/s) and a corpus of several million full-text papers (throughput
around 200 docs/s, bound by disk IO). It is also possible to scale
MongoDb horizontally in a cluster setup, or use SSDs to improve
performance.

Let me know if you are interested. I plan to release the code soon.


-- 
Renaud Richardet
Blue Brain Project  PhD candidate
EPFL  Station 15
CH-1015 Lausanne
phone: +41-78-675-9501
http://people.epfl.ch/renaud.richardet

Mime
View raw message