tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thamme Gowda (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2360) Handle SentimentParser resource failure more robustly
Date Wed, 17 May 2017 03:43:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013472#comment-16013472

Thamme Gowda commented on TIKA-2360:

Sorry, I am late to the discussion.
1. (y) to turn it OFF. I had no intention to turn it on by default (The code existed before
in the previously abandoned PR, and I did not think it could become a problem). 

2. About downloading models: We can configure maven build to download the models during the
first time run. The downloaded files shall reside in src/test/resources. I think no need to
include models in src/main/resources since this feature is off by default and these models
increases final jar size. Then whoever turns on the senti-analysis feature should manually
configure it (Step1.  wget the model and Step2. set its path in the XML File)

3. regarding Parser or Not: 
Agree with [~tallison@mitre.org]. 
We have two kinds of recognisers - so we need two parsers.
First: NER, SentimentAnalysis, AgePredictor or any other text/NLP classifier -  INPUT:text
input and OUTPUT:set of metadata key values. 
Second:ObjectRecogniser, VideoLabeller, OCR, Caption -  INPUT:raw bytes and OUTPUT:set of
metadata key values.
My suggestion: Let us create two generic parsers. First one extracts text and the other one
does not extract. All the machine learning (ML) actions can be seen as add-ons to these two
parsers. We can let configurations to enable and disable the add-ons.
The ML features that we can support by holding its input content in memory (such as extracted
text) can be add-ons to the generic parser, with this we can call many add-ons in line per
one read-parse-extract call, and merge all the metadata.
The ML features for which we cannot hold its content in memory (such as a large video) can
be independent parsers, we shall let it stream the raw content directly in its own.
WDYT about this approach?

> Handle SentimentParser resource failure more robustly
> -----------------------------------------------------
>                 Key: TIKA-2360
>                 URL: https://issues.apache.org/jira/browse/TIKA-2360
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Blocker
>             Fix For: 1.15
> The SentimentParser tests currently require a network call to github.  For those working
behind a proxy or would prefer Tika not to make unexpected network calls, can we please turn
this off by default?

This message was sent by Atlassian JIRA

View raw message