nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2033) parse-tika skips valid documents.
Date Tue, 09 Jun 2015 00:10:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luis Lopez updated NUTCH-2033:
------------------------------
    Description: 
If we run:
{code}
bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription
{code}

we’ll get:

{code}
Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml
{code}

the same occurs  for:
{code}
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
{code}

Both perfectly valid documents if they were returned as "application/xml" and "text/plain"
respectively. 

This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite
mime types are not included in this list even though they are perfectly valid and parsable
documents. This not taking into account that servers often return incorrect mime types for
the documents requested.

We created a helper class as a workaround for this issue. The class uses regex expressions
to define synonyms. In the first case any mime type that matches "application/(.*)\+xml" will
be replaced by "application/xml". This way parse-tika will parse the document just fine.



  was:
If we run:
```
bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription```

we’ll get:

Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml

the same occurs  for:
{code}
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
{code}

Both perfectly valid documents if they were returned as "application/xml" and "text/plain"
respectively. 

This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite
mime types are not included in this list even though they are perfectly valid and parsable
documents. This not taking into account that servers often return incorrect mime types for
the documents requested.

We created a helper class as a workaround for this issue. The class uses regex expressions
to define synonyms. In the first case any mime type that matches "application/(.*)\+xml" will
be replaced by "application/xml". This way parse-tika will parse the document just fine.




> parse-tika skips valid documents.
> ---------------------------------
>
>                 Key: NUTCH-2033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2033
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: mime-type, parse-tika, parser, tika
>             Fix For: 1.11
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and "text/plain"
respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable parser, some
composite mime types are not included in this list even though they are perfectly valid and
parsable documents. This not taking into account that servers often return incorrect mime
types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses regex expressions
to define synonyms. In the first case any mime type that matches "application/(.*)\+xml" will
be replaced by "application/xml". This way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message