nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1991) Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
Date Wed, 22 Apr 2015 20:49:59 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-1991:
-----------------------------------
    Attachment: NUTCH-1991-trunk.v2.patch

Thanks, [~ilopata1]! Updated patch to apply against trunk - only the core remains (use mimeTypes.detect()
instead of tika.detect(). Tested: tika-mimetypes.xml is loaded from $NUTCH_HOME/conf/ if property
mime.types.file is set.

> Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1991
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1991
>             Project: Nutch
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1, 1.10, 1.11, 2.3.1
>            Reporter: Iain Lopata
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-1991-1.6.patch, NUTCH-1991-trunk.v2.patch
>
>
> From Nutch Version 1.5 onwards the MimeUtil.java class that acts as a facade to Tika
to perform mime type detection uses a process that attempts a match using the mimetype returned
by the server, the filename and the content. NUTCH-1045 provided for the use of an external
tika-mimetype.xml file which provides the configuration for this process.  However, the content
based detection did not use this file, but instead reverted to using the configuration included
in the tika library.  Consequently, any content based match rules added to the nutch version
of the configuration file were not used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message