ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: Questions about dictionary-lookup and dictionary-lookup-fast
Date Tue, 10 Mar 2015 16:53:43 GMT
Hi Maite,

> Does anyone know why is it [UmlsDictionaryLookupAnnotator ]so slow?
The top 5 reasons (1-3 are 90% of the problem):
1.  The dictionary database is bloated with unwanted entries
2.  The dictionary database indexing is sub-optimal
3.  The second drug lookup with orangebook filtering takes extra time
4.  The matching algorithm does a little more work than is necessary
5.  There is some redundancy

> my interest is to be able to create my own HsqlDb-based dictionary
If you want to build a database using a subset of UMLS, check out the Dictionary Tool in Sandbox.
 It can build custom hsqldb dictionaries in both the new (-fast) and old format using sources,
tuis, filters, etc. that you specify in plaintext parameter files.  Several types of default
setups are already available.  It is fully functional, but it has been a work-in-progress
during my off-hours, so functionality changes and documentation is lacking, but there is a
howto.txt  in the dictionarytool/doc/ directory.

*NOTE: if your custom dictionaries are small (~1000 entries?) then it would probably be easier
to just throw them into a bar-separated value (bsv) file.  There are examples in the dictionary-fast-res
example/bsv/ directory.  


-----Original Message-----
From: Maite Meseure Hugues [mailto:meseure.maite@gmail.com] 
Sent: Tuesday, March 10, 2015 12:35 PM
To: dev@ctakes.apache.org
Subject: Questions about dictionary-lookup and dictionary-lookup-fast

Hi everyone,

1) I am currently working on BagOfCuisGenerator.java with the analysis engine 'AggregatePlaintextUMLSProcessor.xml',
but that process is very slow at that step:

INFO UmlsDictionaryLookupAnnotator - process(JCas)

Does anyone know why is it so slow?

2) I also tried with 'AggregatePlaintextFastUMLSProcessor.xml' and it's actually pretty fast
like his name suggests, but my interest is to be able to create my own HsqlDb-based dictionary
like we can do with a Lucene index and integrate it in the process, is it possible with the
fast version? Do you have any pointers that could allow me to do that?

Thank you very much for you time.

 Maïté Meseure Hugues
View raw message