uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marshall Schor (JIRA)" <uima-...@incubator.apache.org>
Subject [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
Date Mon, 19 May 2008 15:47:58 GMT

    [ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597983#action_12597983

Marshall Schor commented on UIMA-1033:

Requested Software Grant - awaiting confirmation of receipt of same before loading into SVN.

> ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
> ----------------------------------------------------------------------------------
>                 Key: UIMA-1033
>                 URL: https://issues.apache.org/jira/browse/UIMA-1033
>             Project: UIMA
>          Issue Type: New Feature
>          Components: Sandbox
>         Environment: Java 5
>            Reporter: Michael Tanenblatt
>            Priority: Minor
>         Attachments: conceptMapper.zip, conceptMapper.zip.md5
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> ConceptMapper is a token-based dictionary lookup UIMA component. It was
> designed specifically to allow any external tokenizer that is a UIMA
> component to be used to tokenize its dictionary. Using the same tokenizer
> on both the dictionary and for subsequent text processing prevents
> situations where a particular dictionary entry is not found, though it
> exists, because it was tokenized differently than the text being processed.
> ConceptMapper is highly configurable, in terms of:
>  * the way dictionary entries are mapped to resultant annotations
>  * the way input documents are processed
>  * the availability of multiple lookup strategies
>  * its various output options.
> Additionally, a set of post-processing filters are supplied, as well as an
> interface to easily create new filters. This allows for overgenerating
> results during the lookup phase, if so desired, then reducing the result
> set according to particular rules.
> More details:
> The structure of the dictionary itself is quite flexible. Entries can have
> any number of variants (synonyms), and arbitrary features can be associated
> with dictionary entries. Individual variants inherit features from parent
> token (i.e., the canonical from), but can override them or add additional
> features. In the following sample dictionary entry, there are 5 variants of
> the canonical form, and as described earlier, each inherits the SemClass
> and POS attributes from the canonical form, with the exception of the
> variant "mesenteric fibromatosis (c48.1)", which overrides the value of the
> SemClass attribute (this is somewhat of a contrived example, just to make
> that point):
> <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
>    <variant base="abdominal fibromatosis" />
>    <variant base="abdominal desmoid" />
>    <variant base="mesenteric fibromatosis (c48.1)"
> SemClass="Diagnosis-Site" />
>    <variant base="mesenteric fibromatosis" />
>    <variant base="retroperitoneal fibromatosis" />
> </token>
> Input tokens are processed one span at a time, where both the token and
> span (usually a sentence) annotation type are configurable. Additionally,
> the particular feature of the token annotation to use for lookups can be
> specified, otherwise its covered text is used. Other input configuration
> settings are whether to use case sensitive matching, an optional class name
> of a stemmer to apply to the tokens, and a list of stop words to to ignore
> during lookup. One additional input control mechanism is the ability to
> skip tokens during lookups based on particular feature values. In this way,
> it is easy to skip, for example, all tokens with particular part of speech
> tags, or with some previously computed semantic class.
> Output is in the form of new annotations, and the type of resulting
> annotations can be specified in a descriptor file. The mapping from
> dictionary entry attributes to the result annotation features can also be
> specified. Additionally, a string containing the matched text, a list of
> matched tokens, and the span enclosing the match can be specified to be set
> in the result annotations. It is also possible to indicate dictionary
> attributes to write back into each of the matched tokens.
> Dictionary lookup is controlled by three parameters in the descriptor, one
> of which allows for order-independent lookup (i.e., A B == B A), another
> togles between finding only the longest match vs. finding all possible
> matches. The final parameter specifies the search strategy, of which there
> are three. The default search strategy only considers contiguous tokens
> (not including tokens frm the stop word list or otherwise skipped tokens),
> and then begins the subsequent search after the longest match. The second
> strategy allows for ignoring non-matching tokens, allowing for disjoint
> matches, so that a dictionary entry of
>     A C
> would match against the text
>     A B C
> As with the default search strategy, the subsequent search begins after the
> longest match. The final search strategy is identical to the previous,
> except that subsequent searches begin one token ahead, instead of after the
> previous match. This enables overlapped matching.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message