tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mohsen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2724) Tika does not recognize http 3xx error codes when passed fileUrl
Date Wed, 05 Sep 2018 15:32:01 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mohsen updated TIKA-2724:
-------------------------
    Description: 
When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika happily
returns a 200 response.

*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a fileUrl}} to the server that returns a 300 error code. Here is a sample
curl session:
{code}
$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
 

  was:
When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika happily
returns a 200 response.

*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a }}{{fileUrl}} to the server that returns a 300 error code. Here is a
sample curl session:
{code:bash}
$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
 


> Tika does not recognize http 3xx error codes when passed fileUrl
> ----------------------------------------------------------------
>
>                 Key: TIKA-2724
>                 URL: https://issues.apache.org/jira/browse/TIKA-2724
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.18
>            Reporter: Mohsen
>            Priority: Major
>
> When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika
happily returns a 200 response.
> *How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a fileUrl}} to the server that returns a 300 error code. Here is a sample
curl session:
> {code}
> $ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
> * Trying ::1...
> * TCP_NODELAY set
> * Connected to localhost (::1) port 9998 (#0)
> > PUT /rmeta/text HTTP/1.1
> > Host: localhost:9998
> > User-Agent: curl/7.54.0
> > Accept: */*
> > fileUrl:http://google.com
> >
> < HTTP/1.1 200 OK
> < Content-Type: application/json
> < Date: Wed, 05 Sep 2018 15:25:12 GMT
> < Transfer-Encoding: chunked
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Connection #0 to host localhost left intact
> [{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message