nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
Date Sun, 26 Apr 2015 03:35:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512858#comment-14512858
] 

Chris A. Mattmann commented on NUTCH-1994:
------------------------------------------

Hey [~jorgelbg] I thought it was NUTCH-1991 but that appears to be a red herring. This first
appeared on the commit of NUTCH-1994 and I have been working on this all day to try and figure
out if it was due to NUTCH-1991 and it seems that it wasn't. 

I'm down to this error in parse-zip (excuse my System.out.printlns):

{noformat}
2015-04-25 20:31:54,378 INFO  conf.Configuration (Configuration.java:getConfResourceAsInputStream(1017))
- found resource parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,408 INFO  conf.Configuration (Configuration.java:getConfResourceAsInputStream(1017))
- found resource parse-plugins.xml at file:/Users/mattmann/src/nutch/conf/parse-plugins.xml
2015-04-25 20:31:54,414 INFO  parse.ParserFactory (ParserFactory.java:matchExtensions(376))
- The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes
system property, and all claim to support the content type text/plain, but they are not mapped
to it  in the parse-plugins.xml file
PARSER RETRIEVED! NULL!
2015-04-25 20:31:54,473 ERROR tika.TikaParser (TikaParser.java:getParse(86)) - Can't retrieve
Tika parser for mime-type text/plain
RESULT TEXT! textfile.txt  
HERE IS THE PARSE TEXT textfile.txt  
{noformat}

So, looks like on getParse in TikaParser.java, it can't retrieve the Tika parser for text/plain
(the zip file in the sample directory for parse-zip contains a single text file, textfile.txt,
which contains the expected text). Since the appropriate Tika parser can't be retrieved, the
parser only extracts the filename, and not the text as well hence the test is failing.

Trying to figure out why it can't find the Tika parser for Tika 1.8 for text/plain.

> Upgrade to Apache Tika 1.8
> --------------------------
>
>                 Key: NUTCH-1994
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1994
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build, parser
>    Affects Versions: 1.10, 2.3.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.10, 2.3.1
>
>         Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch
>
>
> Tika 1.8 was released this morning.
> Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message