tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giuseppe Totaro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser
Date Wed, 03 Jun 2015 07:59:49 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Giuseppe Totaro updated TIKA-1645:
----------------------------------
    Labels: patch  (was: )

> Extraction of biomedical information using CTAKESParser
> -------------------------------------------------------
>
>                 Key: TIKA-1645
>                 URL: https://issues.apache.org/jira/browse/TIKA-1645
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: patch
>         Attachments: TIKA-1645.patch
>
>
> As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler]
is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into
Tika allowing users to extract biomedical information from clinical text.
> Essentially, this work includes a wrapper for CAS serializers that aim at dumping out
the identified annotations into XML-based formats.
> You can find in attachment a new patch that includes the CTAKESParser, a new parser that
decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based
on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser
generates the same output of AutoDetectParser and, in addition, the metadata containing the
identified clinical annotations detected by cTAKES.
> To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install
the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
Then, you can launch Tika as follows:
> {noformat}
> CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
> java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
> {noformat}
> In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}}
that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}}
is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will
perform parsing.
> You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}}
to parse ISA-Tab files using cTAKES.
> You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based
components of cTAKES.
> I would really appreciate your feedback.
> Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message