nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2033) parse-tika skips valid documents.
Date Wed, 03 Jun 2015 19:14:38 GMT
Luis Lopez created NUTCH-2033:
---------------------------------

             Summary: parse-tika skips valid documents.
                 Key: NUTCH-2033
                 URL: https://issues.apache.org/jira/browse/NUTCH-2033
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.10
            Reporter: Luis Lopez
             Fix For: 1.11


If we run:
bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription

we’ll get:

Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml

the same occurs  for:
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json

Both perfectly valid documents if they were returned as "application/xml" and "text/plain"
respectively. 

This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite
mime types are not included in this list even though they are perfectly valid and parsable
documents. This not taking into account that servers often return incorrect mime types for
the documents requested.

We created a helper class as a workaround for this issue. The class uses regex expressions
to define synonyms. In the first case any mime type that matches "application/(.*)\+xml" will
be replaced by "application/xml". This way parse-tika will parse the document just fine.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message